Hi
So, to kick things off my name is Chris Nesbitt-Smith, I'm based in London and currently work with some well known brands like learnk8s, control plane, and various bits of UK Government I'm also a tinkerer of open source stuff. I've using and abusing Kubernetes in production since it was 0.4, believe me when I say its been a journey! I've definitely got the war wounds to show for it. We should have time for questions and heckles at the end, but if we run out of time or you're not watching this in realtime, then please find me on LinkedIn or in the linode slack
Kubernetes embraces the idea of treating your servers as a single unit and abstracts away how individual computer resources operate. From a collection of three servers, Kubernetes makes a single cluster that behaves like one.
1/4 So if we Imagine having three servers.
2/4 You can use one of those servers to install the Kubernetes control plane.
3/4 The remaining servers can join the cluster as worker nodes.
4/4 Once the setup is completed, the servers are abstracted from you. You deal with Kubernetes as a single unit.
1/4 When you want to deploy a container, you submit your request to the cluster. Kubernetes takes care of executing `docker run` and scheduling the container in the right server.
2/4 The same happens for all other containers.
3/4
4/4 For every deployment, Kubernetes finds the best place to run the application.
Kubernetes can automatically scale your application for you
But you'll likely find you run out of compute resource but all is not lost, Kubernetes has yet another trick up its sleeve
Given access to your underlying infrastructure when you run out
it can even dynamically provision additional compute when you require it using the Cluster autoscaler
If you didn't see Salman's fantastic talk on this a couple of weeks ago, I highly recommend watching that back for a really insightful walk through that
So far this is all sounding great, clusters can scale up and down both in workload and compute resource, and scheduling is all working perfectly
but you'll quickly find when your workload scales down, it might not happen how you would anticipate it to
which can lead you to undesirably loaded clusters, when what you'd really like is for things to rebalance
So If we look at a deployment spec
You'll notice there is no field or instruction for Kubernetes on how you'd like your workload to be rebalanced. Put simply once Kubernetes has scheduled the workload, it considers it's job done, thats the end, until something goes wrong.
This has been an issue that has played on many people's minds which has led to the desire for descheduling workload in order to let the scheduler readjust with new information
I might be showing my age now, but I remember a time when I used to have to defragment my hard disk
Hours staring at a screen that looks like this because the way the file system works is data is written anywhere, then when its later deleted leaves gaps
Not at all dissimilar to how the Kubernetes scheduler works
When you delete workload you can find yourself with gaps, that if you were to reschedule everything from scratch wouldn't exist
The effect of descheduling then causes our old friend the Kubernetes scheduler
to notice that a pod has been deleted from an undesirable location
create a new pod
and then go through all the process
of filtering what nodes are available to run the workload with all sorts of complex rules
then scoring them through some more very complex rules
before deciding where best to place the workload
In order to do this the descheduler has policies you can define at a cluster level
these are split into two categories of balance and deschedule where balance is intended to redistribute your workload across the nodes the deschedule is intended to cause the workload to be evicted based on some rules such as the lifetime of a pod
And the configuration looks very familiar to other things in Kubernetes in that it closely resembles a CRD though don't be fooled by this, its not actually a CRD, you have to put this into a specifically named configmap as a text blob
So here we can define our plugin configurations in our profile
And elect which plugins should be enabled
and our categories we saw earlier are referred to as extension points
there are other extension points available, but we'll be focusing on the deschedule and balance ones today
A common use case might be, you've got applications that for whatever reason you want to restart, once an hour, or night because reasons
So an example of that configuration might look like this, where we will look to restart pods over 10 seconds old
this is the exciting part where i pray to the demo gods for the first time today....
There are a few options for how you can run the descheduler on your cluster
You could run it as a one off job, or perhaps you've got some other orchestration system that will create jobs on your cluster for you
Or a cronjob, resulting in periodic jobs being created as you desire
Or a deployment that will run all the time in a loop
To look at cronjob option
Here we can see a cronjob specification, that will execute every minute, depending on your size of cluster and shape of workload this may be undesirable, you may want more or less frequent
Because it runs as a cronjob the descheduler pod is created somewhere on your cluster, allocated dynamically, and of course uses resource of its own
which could end up influencing the schuedler when it comes round to rescheduling all the work it deletes
So having it disappear as soon as its descheduled the other workloads allows for the new rebalanced scheduling to happen without the presence of the deschedulers influence
Another approach is to run the descheduler in a deployment
It does support a highly available configuration where you can have multiple concurrent deschedulers running
And they will periodically run to delete pods
However only one descheduler is actually doing the work, they will elect a leader, and the other replicas will only become active if the current leader is unavailable
So to look at another policy available to you, theres the duplicate policy
if we consider a scenario like this
if you were to loose the right hand node, your orange pods would suffer a 67% impact and your green pods would be entirely unavailable until the node outage were picked up some 5 minutes later
which is what the remove duplicates balancer is intended for
which should cause your workload to rebalance across your nodes and reduce your concentration of risk a single node
demo gods..
Metrics are a pretty big deal in Kubernetes as you are no doubt aware, while there are more advanced metrics capabilities available
the descheduler uses some more primitive and consistently available ones if we remember how the kubelet that exists on every node works
The job of the kubelet is to keep the current node synchronized with the state of the control plane. So it continuously polls the control plane for updates. Remember when we said that the scheduler assigns pods to nodes? If the kubelet finds a pod assigned to the current node, it will retrieve the spec for that pod.
causing the docker image to be pulled down, given to container d and start running
the kubelet then reports the ongoing health of the node and the pods to the kubelet, and it uses cAdvisor to gather local metrics on CPU, memory and disk concentration
to send off to the Kubernetes api server where that is tracked for the astute amongst you, cadvisor is due to be replaced in the next release of Kubernetes, but the principal will remain the same
using that data allows us to make some interesting decisions on how to deschedule our workloads to reach a more desirable balance
an example of that is the high node utilization plugin, which will work to schedule your workloads to maximise your bang for buck on compute nodes
so if you have a node looking like this
it will identify the under utilized node and deschedule the workload in order to allow your cluster auto scaler to remove the node
As a quick refresher
1/5 When you provision an EC2 instance, you might think that the memory and CPU available can be used for running Pods. And you are right.
2/5 However, some memory and CPU should be saved for the operating system.
3/5 And you should also reserve memory and CPU for the kubelet.
4/5 Is the rest made available to the pods?
5/5 Not quite yet. You also need to reserve memory for the Eviction threshold. If the kubelet notices that memory usage is going over that threshold, it will start evicting pods.
Daniele (or d5e to his friends) Did a brilliant talk on this last month, and the considerations on how to right size your cluster, which if you didn't see, then please do seek that out
next up is the low utilisation policy
which provides a few more options to try and achieve a sweet spot of node
utilisation by providing an upper threshold and a lower threshold in my scenario a node under 20% is considered under utilised and over 70% is over utilised
which will cause the descheduler to rebalance accordingly to try and get the nodes into that sweet spot between
last demo gods
Related to a lot of this space is the node problem detector there are loads of things that can go wrong on a node, but without this installed, Kubernetes will be totally unaware and continue to schedule workload on to an unhappy node until it is marked offline 5 minutes after it has totally failed
which you can run as a daemonset in your cluster, meaning that it will run on every node
and can detect things such as NTP being down or out of sync
CPU, memory and disk issues
kernel deadlocks, corrupted file systems
issues with the container runtime
or the kubelet
and report that to the Kubernetes controlplane
the node controller is the Kubernetes component that is ready to process that information
after it has arrived at the api server
the node controller lives in the controller manager
and can add taints such as unreachable to the node
combining this all
with the node problem detector deployed
if it detects that the node is unreachable
the node can be tainted
and the taints violation policy
could be configured to reschedule the workload on that node
causing the descheduler to evict all the pods
and allow the scheduler to rebalance the workload
preventing any pods being scheduled on the right hand node
in combination with the cluster autosscaler
it can notice that the utilization is low and trigger the downscaling of that node
That was a long journey! So some key takeaways
Vanilla Kubernetes will not rebalance or defrag your nodes and pods
The descheduler exists as an add on that will take on this task
it is configured by policies that dictate its behaviour and will drive your cluster to a more desirable configuration
and it will take low level metrics from the nodes and pods in order to inform this rather than using metrics-server or similar
And the node problem detector can be used to provide early reactions to nodes becoming unhealthy and direct your workload to run elsewhere
Thank you very much for your time I've been Chris Nesbitt-Smith find me on linked in, and do be sure to check out the other webinars we've done with linode Like, subscribe and whatever the kids do these days
I'll now open the floor to any questions, if we don't get to you or you're not watching this in realtime, then please do join the slack community