Hi, Thank you so much for joining me here today, it'd be great to hear where you're all from so please do leave a comment in the chat and introduce yourself. Likewise please use the comments if you've got any questions throughout this webinar and I'll do my best to get to them at the end, I'm also joined by some friends helping me in the chat who may get to them before I do. --- So, to kick things off my name is Chris Nesbitt-Smith, I'm based in London and currently an instructor for Learnk8s, consultant to UK Government and tinkerer of open source stuff. I've using and abusing Kubernetes in production since it was 0.4, believe me when I say its been a journey! I've definitely got the scars to show for it. --- So you believed the hype, that Kubernetes lets you scale infinitely, auto heal and so on. Your cluster is self monitoring and scaling up instances of your cloud native stateless applications on demand when you need more. --- But all of a sudden your nodes are full, you can scale no more --- Enter the cluster autoscaler and of course a splash of yaml to save the day --- It can integrate with your cloud vendor --- To provision necessary nodes --- and good news the autoscaler is configurable --- though sadly as we'll see its not quite as configurable as you might expect --- there are alternatives but the official cluster autoscaler only scales up when there are pending pods --- in order to satisfy the demand, which is probably a good idea since there is little point adding more nodes unless you have workload that needs them --- ok, so first lets refresh ourselves on how the Kubernetes scheduler works --- If I create a deployment with 2 replicas --- I do this by submitting my yaml to the API server, which writes it to etcd --- the controller is watching for this type of event, recognizes it needs to create some pods, which it does and these are now pending --- the scheduler is the component that is looking for pending pods, sees these and then schedules them to a node --- the scheduling however is broken into a few steps From the initial queue through filtering viable nodes to then scoring them before creating a binding --- but how does the scheduler know how much memory and cpu a pod uses? It does not… --- you need to spoon feed this with requests and limits --- If you don't specify requests and limits Kubernetes will play blind, your cluster will become overloaded, nodes will become over subscribed and you'll be constantly fighting fires. So if your only takeaway is that all your containers should have requests and limits defined then we've done something useful here! Requests are the initial ask, and limits are the point that your container will be throttled if its CPU or killed if its memory, --- Applications come in all sorts of shapes and sizes, so you may have some applications that are CPU intensive but don't require much memory, while others --- May have a greater memory than CPU footprint --- Those applications have to be deployed inside computing units which have (again) CPU and memory characteristics. --- For every application deployed in the cluster, Kubernetes makes a note of the memory and CPU requirement. --- It then decides where to place the application in the cluster. In this case, it's the node on the left. --- If another application of the same size is deployed, Kubernetes goes through the same process and finds the best node to run the app. --- In this case, Kubernetes places the app on the right side. --- As more applications --- are submitted to the cluster, --- Kubernetes keeps making --- notes of the CPU and --- memory requirements... --- ... and allocating these apps in the cluster. --- If you play the game long enough, you might notice that Kubernetes is a skilled Tetris player: - Your servers are the board. - The apps are the blocks. Kubernetes tries to fit as many blocks as efficiently as possible. --- But what about the size of the worker nodes What kind of instance types can you use to build the cluster? Nowadays the cloud vendors make almost every instance type available to be part of a cluster, so you've got free choice. There's a catch, though. --- You'd be forgiven for thinking that if you get an 8gb ram 2 cpu node from your cloud vendor you could deploy 4 pods that are 1.5gb ram and need a quarter CPU --- however its not quite so --- one pod remains pending, which if configured will of course cause the --- cluster autoscaler to create a new node and then your workload is eventually scheduled --- but why is this --- When you provision a managed instance, you might think that the memory and CPU available can be used for running Pods. And you are right. --- However, some memory and CPU should be saved for the operating system. --- And you should also reserve memory and CPU for the kubelet. --- Surely the the rest is made available to the pods? --- Not quite yet. You also need to reserve memory for the Eviction threshold. If the kubelet notices that memory usage is going over that threshold, it will start evicting pods. --- Your cloud vendor will usually choose these numbers for you For example AWS reserves 255MB of memory for the kubelet... --- And 11MB of memory for each Pod that you could deploy on that instance. --- This is the reserved memory for the kubelet. The CPU reserved is usually around 0.3 to 0.4. For the operating system they reserve 100MB of memory and 0.1 CPU and for the eviction threshold, another 100 MB. --- In AWS If you select an M5.large, here's a visual recap of how the resources are subdivided. With this particular instance, you can deploy up to 27 pods. --- The other thing to consider is all this takes time --- Lets assume you've configured the horizontal pod autoscaler or HPA to scale up your pods dynamically, well thats where the journey probably starts --- to start with about 90 seconds for your Horizontal Pod Autoscaler to react and decide to scale up --- Then the cluster autoscaler takes around 30 seconds to request a new node from the cloud vendor --- Then around 4minutes for the machine to boot --- Then around another 30 seconds to join the cluster and be ready to run workload Then you can of course add on time for pulling your container image that won't be cached --- To help visualize the impact this can have I have made a library that fakes a Kubernetes scheduler, and it allows you to specify many different types of pods, and model their scaling dynamics, tracking container startup times, and so on. And define your node properties, it takes a lot of shortcuts in order to provide hundreds of thousands of intervals representing days in tens of milliseconds, it is not the real Kubernetes scheduler. Pull requests are very welcome if you'd like to improve it! --- And to give you a way to play with it, I also made a game as a novelty for kubecon last year called Black Friday. The scenario is that you're an SRE team supporting a retailer facing a spike in traffic on black friday and again on cyber Monday, with a lull between and a calm before and after. It is a three tier service of frontend, backend, and database all of which have different scaling properties, startup times etc So the scenario starts Thursday midnight, and ends Tuesday 23:59 You're on the hook for it so there are SLA penalties if you cause a request to be failed It's a simple three tier app, if you go into the hints, you'll see some of the constraints. The goal is configure your cluster to as closely follow the traffic spike, with just enough infra, failing some requests and getting a few SLA penalties might actually result in a greater profit. DEMO POINT CLICK, CHANGE MIN THINGS TO 1 OF EACH AND DEMONSTRATE PROFIT, EXPLAIN GRAPHS Please do feel free to play, may the odds ever be in your favour --- So, what do we do to stack the odds in our favor --- Well, we can not scale at all, thats always an option thats often over looked --- Or what if you could get a head start on the scaling --- Maybe not scaling sounds a bit flippant, what do I really mean by that? --- Going back to our scenario of fitting our pods on a machine --- taking into account the reserves for the kublet --- if you size the machine correctly, you can fit all your workload in the node --- This isn't easy given the vast array of possible machine sizes, so we've done the hard work for you and created an instance calculator --- DEMO drag sliders about --- Finally, on to the topic of this webinar, the wait is over --- What if we could always have at least one node ready for when you need it, removing that 4 minute wait To do this we can create a placeholder pod, that --- as soon your workload comes along needing the resource, the placeholder pod is evicted --- causing the cluster autoscaler to boot a new machine to host the new replacement placeholder, and this will continue as you scale into further nodes, keeping you always one step ahead --- Ok now to pray to the demo gods where I do a real live demo --- I've got a simple application, where you can see the effects of me clicking the scale buttons, behind this is a real kubernetes cluster running in linnode, I've just got some javascript driving the changes to the kubernetes api to scale up and down so we start with 1 replica, and I click to scale to 5, the current node gets saturated with 4 pods, and one is pending, behind the scenes now the cluster autoscaler is going to request a new node from linnode, so while I stall for about 3 minutes of what would otherwise be silence and me praying for it to work are there any questions? As you saw, the node became available, and then it took a little longer for the CNI to come up and then to be able to schedule our pod. And the timer shows that took ..... ok, now lets schedule back down to 1 and enable our placeholder now thats all running, lets try scaling to 5 again and as you can see that is far more performant phew! that was more stressful than you can imagine, I can assure you there is a real cluster, to prove that really happened --- --- SKIP IF THE LIVE DEMO WORKED I've built a simple application, where you can see the effects of me clicking the scale button, so we start with 1 replica, and I scale to 5, as you can see with my old world this takes some time to eventually add the additional node and scale --- SKIP IF THE LIVE DEMO WORKED So same again, only now with our proactive approach of having a placeholder, as soon as I scale to 5, my placeholder becomes descheduled from the node it was occupying, so I've scaled up in around 10 seconds as opposed to the around 3 minutes we saw on the previous demo eventually a new node is provisioned with the placeholder on it again --- So how do we make this happen --- Firstly we need a placeholder that is big enough to know it'll never be schedule-able along side any real workload on a node, so it should be sized big enough to fill the node --- Then you need to specify a low priority class, to make sure it is evicted as soon as there is a real workload The placeholder pod competes for resources, so we need to define that we want it to have a low priority. With a priority of -1, all other pods will have precedence, and the placeholder is evicted as soon as the cluster runs out of space. --- --- OK now for a demo of how this works with pod autoscaling, in a real-ish world scenario I'll be honest with you, I've exhausted my credit with the demo gods, so I'm going to play some video and provide a little narration. --- Before I start, to provide a little orientation, on the left side we can see requests per second that we're serving, bottom left you can see the nodes and the pods on them as you can see my nodes can take up to four of my workload pods I've got a simple application that can handle a fixed number of requests, and I'm ramping up traffic, as you can see we start with two nodes, and we're able to sustainably handle the traffic increasing in, until we fill both nodes, and we continue to start backing up a list of pending pods that the HPA has decided it needs. then we finally see Cluster autoscaler provide the nodes. I have manipulated these results a little so as to not leave you waiting the around 4 minutes for the node to be available. In that time while we waited for the nodes to be available the traffic we're able to service really flattens out, but as soon as we've got more resource it goes up --- Now we can compare that to our proactive pattern, where we have a placeholder pod keeping us a spare node ready at all times. As the traffic builds up, we see that placeholder quickly become evicted and our workload pods become scheduled on that node. A new placeholder pod is created as pending, causing the cluster autoscaler to create a new node. Sometimes however as happens here, the traffic build up and HPA has outpaced the speed at which we were able to stand up a new node. --- This all comes at an inevitable cost, your plan is to always have extra capacity ready and waiting for your workload to require it. What might therefore be the better answers? --- You can tune your workload to make sure you're not leaving gaps --- Or better yet, remember that pod priority thing we used, if you've got workload that suits on your cluster that you'd like to run and would give you more return on investment than the placeholder, which can handle stopping and starting when you need Perhaps some house keeping, analytics, machine learning, or maybe just less important services, so you might want to prioritize the shopping cart supporting pods over the customer service desk ones you could structure your cluster workload to be more aligned to your businesses benefits --- I've been Chris Nesbitt-Smith, thank you again for joining me today. Like subscribe whatever the kids do these days on LinkedIn, Github whatever and you can be assured there'll be no spam or much content at all since I'm awful at self promotion especially on social media. cns.me just points at my LinkedIn. talks.cns.me contains this and other talks, they're all open source. Questions are very welcome on this or anything else. If I've not got to your question or you're not watching this live I'll do my best to get back to you, just leave the question in the kubernetes scaling slack channel feel free to @ me so I see it!