Hi --- So, to kick things off my name is Chris Nesbitt-Smith, I'm based in London and currently work with some well known brands like learnk8s, control plane, and various bits of UK Government I'm also a tinkerer of open source stuff. I've using and abusing Kubernetes in production since it was 0.4, believe me when I say its been a journey! I've definitely got the war wounds to show for it. We should have time for questions and heckles at the end, but if we run out of time or you're not watching this in realtime, then please find me on LinkedIn or in the linode slack --- Kubernetes embraces the idea of treating your servers as a single unit and abstracts away how individual computer resources operate. From a collection of three servers, Kubernetes makes a single cluster that behaves like one. --- 1/4 So if we Imagine having three servers. --- 2/4 You can use one of those servers to install the Kubernetes control plane. --- 3/4 The remaining servers can join the cluster as worker nodes. --- 4/4 Once the setup is completed, the servers are abstracted from you. You deal with Kubernetes as a single unit. --- 1/4 When you want to deploy a container, you submit your request to the cluster. Kubernetes takes care of executing `docker run` and scheduling the container in the right server. --- 2/4 The same happens for all other containers. --- 3/4 --- 4/4 For every deployment, Kubernetes finds the best place to run the application. --- Kubernetes can automatically scale your application for you --- But you'll likely find you run out of compute resource but all is not lost, Kubernetes has yet another trick up its sleeve --- Given access to your underlying infrastructure when you run out --- it can even dynamically provision additional compute when you require it using the Cluster autoscaler --- If you didn't see Salman's fantastic talk on this a couple of weeks ago, I highly recommend watching that back for a really insightful walk through that --- So far this is all sounding great, clusters can scale up and down both in workload and compute resource, and scheduling is all working perfectly --- but you'll quickly find when your workload scales down, it might not happen how you would anticipate it to --- which can lead you to undesirably loaded clusters, when what you'd really like is for things to rebalance --- So If we look at a deployment spec --- You'll notice there is no field or instruction for Kubernetes on how you'd like your workload to be rebalanced. Put simply once Kubernetes has scheduled the workload, it considers it's job done, thats the end, until something goes wrong. --- This has been an issue that has played on many people's minds which has led to the desire for descheduling workload in order to let the scheduler readjust with new information --- I might be showing my age now, but I remember a time when I used to have to defragment my hard disk --- Hours staring at a screen that looks like this because the way the file system works is data is written anywhere, then when its later deleted leaves gaps --- Not at all dissimilar to how the Kubernetes scheduler works --- When you delete workload you can find yourself with gaps, that if you were to reschedule everything from scratch wouldn't exist --- The effect of descheduling then causes our old friend the Kubernetes scheduler --- to notice that a pod has been deleted from an undesirable location --- create a new pod --- and then go through all the process --- of filtering what nodes are available to run the workload with all sorts of complex rules --- then scoring them through some more very complex rules --- before deciding where best to place the workload --- In order to do this the descheduler has policies you can define at a cluster level --- these are split into two categories of balance and deschedule where balance is intended to redistribute your workload across the nodes the deschedule is intended to cause the workload to be evicted based on some rules such as the lifetime of a pod --- And the configuration looks very familiar to other things in Kubernetes in that it closely resembles a CRD though don't be fooled by this, its not actually a CRD, you have to put this into a specifically named configmap as a text blob --- So here we can define our plugin configurations in our profile --- And elect which plugins should be enabled --- and our categories we saw earlier are referred to as extension points --- there are other extension points available, but we'll be focusing on the deschedule and balance ones today --- A common use case might be, you've got applications that for whatever reason you want to restart, once an hour, or night because reasons --- So an example of that configuration might look like this, where we will look to restart pods over 10 seconds old --- this is the exciting part where i pray to the demo gods for the first time today.... --- There are a few options for how you can run the descheduler on your cluster --- You could run it as a one off job, or perhaps you've got some other orchestration system that will create jobs on your cluster for you --- Or a cronjob, resulting in periodic jobs being created as you desire --- Or a deployment that will run all the time in a loop --- To look at cronjob option --- Here we can see a cronjob specification, that will execute every minute, depending on your size of cluster and shape of workload this may be undesirable, you may want more or less frequent --- Because it runs as a cronjob the descheduler pod is created somewhere on your cluster, allocated dynamically, and of course uses resource of its own --- which could end up influencing the schuedler when it comes round to rescheduling all the work it deletes --- So having it disappear as soon as its descheduled the other workloads allows for the new rebalanced scheduling to happen without the presence of the deschedulers influence --- Another approach is to run the descheduler in a deployment --- It does support a highly available configuration where you can have multiple concurrent deschedulers running --- And they will periodically run to delete pods --- However only one descheduler is actually doing the work, they will elect a leader, and the other replicas will only become active if the current leader is unavailable --- So to look at another policy available to you, theres the duplicate policy --- if we consider a scenario like this --- if you were to loose the right hand node, your orange pods would suffer a 67% impact and your green pods would be entirely unavailable until the node outage were picked up some 5 minutes later --- which is what the remove duplicates balancer is intended for --- which should cause your workload to rebalance across your nodes and reduce your concentration of risk a single node --- demo gods.. --- Metrics are a pretty big deal in Kubernetes as you are no doubt aware, while there are more advanced metrics capabilities available --- the descheduler uses some more primitive and consistently available ones if we remember how the kubelet that exists on every node works --- The job of the kubelet is to keep the current node synchronized with the state of the control plane. So it continuously polls the control plane for updates. Remember when we said that the scheduler assigns pods to nodes? If the kubelet finds a pod assigned to the current node, it will retrieve the spec for that pod. --- causing the docker image to be pulled down, given to container d and start running --- the kubelet then reports the ongoing health of the node and the pods to the kubelet, and it uses cAdvisor to gather local metrics on CPU, memory and disk concentration --- to send off to the Kubernetes api server where that is tracked for the astute amongst you, cadvisor is due to be replaced in the next release of Kubernetes, but the principal will remain the same --- using that data allows us to make some interesting decisions on how to deschedule our workloads to reach a more desirable balance --- an example of that is the high node utilization plugin, which will work to schedule your workloads to maximise your bang for buck on compute nodes --- so if you have a node looking like this --- it will identify the under utilized node and deschedule the workload in order to allow your cluster auto scaler to remove the node --- As a quick refresher --- 1/5 When you provision an EC2 instance, you might think that the memory and CPU available can be used for running Pods. And you are right. --- 2/5 However, some memory and CPU should be saved for the operating system. --- 3/5 And you should also reserve memory and CPU for the kubelet. --- 4/5 Is the rest made available to the pods? --- 5/5 Not quite yet. You also need to reserve memory for the Eviction threshold. If the kubelet notices that memory usage is going over that threshold, it will start evicting pods. --- Daniele (or d5e to his friends) Did a brilliant talk on this last month, and the considerations on how to right size your cluster, which if you didn't see, then please do seek that out --- next up is the low utilisation policy --- which provides a few more options to try and achieve a sweet spot of node --- utilisation by providing an upper threshold and a lower threshold in my scenario a node under 20% is considered under utilised and over 70% is over utilised --- which will cause the descheduler to rebalance accordingly to try and get the nodes into that sweet spot between --- last demo gods --- Related to a lot of this space is the node problem detector there are loads of things that can go wrong on a node, but without this installed, Kubernetes will be totally unaware and continue to schedule workload on to an unhappy node until it is marked offline 5 minutes after it has totally failed --- which you can run as a daemonset in your cluster, meaning that it will run on every node --- and can detect things such as NTP being down or out of sync --- CPU, memory and disk issues --- kernel deadlocks, corrupted file systems --- issues with the container runtime --- or the kubelet --- and report that to the Kubernetes controlplane --- the node controller is the Kubernetes component that is ready to process that information --- after it has arrived at the api server --- the node controller lives in the controller manager --- and can add taints such as unreachable to the node --- combining this all --- with the node problem detector deployed --- if it detects that the node is unreachable --- the node can be tainted --- and the taints violation policy --- could be configured to reschedule the workload on that node --- causing the descheduler to evict all the pods --- and allow the scheduler to rebalance the workload --- preventing any pods being scheduled on the right hand node --- in combination with the cluster autosscaler --- it can notice that the utilization is low and trigger the downscaling of that node --- That was a long journey! So some key takeaways --- Vanilla Kubernetes will not rebalance or defrag your nodes and pods --- The descheduler exists as an add on that will take on this task --- it is configured by policies that dictate its behaviour and will drive your cluster to a more desirable configuration --- and it will take low level metrics from the nodes and pods in order to inform this rather than using metrics-server or similar --- And the node problem detector can be used to provide early reactions to nodes becoming unhealthy and direct your workload to run elsewhere --- Thank you very much for your time I've been Chris Nesbitt-Smith find me on linked in, and do be sure to check out the other webinars we've done with linode Like, subscribe and whatever the kids do these days --- I'll now open the floor to any questions, if we don't get to you or you're not watching this in realtime, then please do join the slack community --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---