Will Varney

Thu May 10, 2018 · 811 words

One of the benefits of moving to Kubernetes is allowing it to manage the deployment and scaling of our workloads. The downside is that without correct configuration our nodes can still suffer from resource exhaustion. By default Kubernetes will assume the entire nodes resources are available for pod workloads. This isn’t the case. We have system components, the kubelet and docker all requiring some of those resources. These all sit external to Kubernetes and therefore it is unaware of their usage.

So we need to explicitly tell the kubelet how much capacity is needed for these components. Otherwise if pods attempt to use all of a nodes resources then we will start to see key components such as the kubelet or docker struggle with resource contention. This can end up with a node being marked as unavailable.

Before understanding how to ‘reserve’ resources for these external components we should understand the concept of ‘allocatable’. It is the term denoting the amount of resource available for pods. Without any changes it should report the full capacity of a node.

To see this, run either a kubectl describe node <node_name> or the below commands to get the CPU and memory.

kubectl get node <node_name> -o jsonpath="{ .status.allocatable.cpu }"
kubectl get node <node_name> -o jsonpath="{ .status.allocatable.memory }"

As well as allocatable, Kubernetes also keeps a record of the nodes capacity.

kubectl get node <node_name> -o jsonpath="{ .status.capacity.cpu }"
kubectl get node <node_name> -o jsonpath="{ .status.capacity.memory }"

If you are wondering why Kubernetes would record and display both, it’s because the amount of a node resources available to pods can be configured to be less than its actual capacity.

So to calculate the allocatable CPU and memory for a node, the kubelet takes the full node capacity and deducts three settings; kube-reserved, system-reserved and an eviction threshold that is specified by either eviction-hard or eviction-soft. The kubelet then takes the result and records it as the ‘allocatable’ resource. These settings are all configurable on the kubelet itself, but in this post I want to focus on kube-reserved and system-reserved.

The kube-reserved setting should be configured to a CPU and memory value that is the sum of all of the Kubernetes components running on a node outside of Kubernetes. This can include the kubelet and docker. Setting this flag doesn’t mean that Kubernetes will enforce this usage. It will merely remove that amount of resource from the amount available for pod workloads.

If you’re setting this via a command line argument on the kubelet, it would take the below format. You don’t have to set both the cpu and memory, you can choose to just set a single resource.

--kube-reserved=cpu=100m,memory=100Mi

If you are familiar with setting resource requests and limits for a pod then you should be familar with these units. CPU is specified as cores, so in this example it’s 100 millicores. This can also be specified as 0.1 of a core, where 1 represents a full core. This equates to 1/10th of a CPU core being reserved.

Memory is specified in units of bytes so 100Mi will mean 100 MB is being reserved.

If you go ahead and set this on one of your nodes and restart the kubelet you can verify the change by running the below commands. You will see that the ‘allocatable’ will differ from the capacity by the amount you’ve set.

# Get node capacity
kubectl get node <node_name> -o jsonpath="{ .status.capacity.cpu }"
kubectl get node <node_name> -o jsonpath="{ .status.capacity.memory }"
# Get allocatable
kubectl get node <node_name> -o jsonpath="{ .status.allocatable.cpu }"
kubectl get node <node_name> -o jsonpath="{ .status.allocatable.memory }"

The system-reserved setting is configured the same way. The difference being that this should be set to what the OS components like the kernel or user login sessions use.

--system-reserved=cpu=100mi,memory=100Mi

Again, just setting this flag will not enforce this reservation but deduct it from the resources available to pods. If you set this flag you can perform the same check as above, after restarting the kubelet, to see that it has taken effect.

If you want these values to be enforced then it is possible. You will need to look into the following settings; --kube-reserved-cgroup, --system-reserved-cgroup and --enforce-node-allocatable.

The documentation warns about making system-reserved enforcable, as starving key system components of CPU or out of memory killing them might cause node instability.

Finally, these settings are not something to be set once and forgotten about. The memory and CPU usage is likely to change over time with newer versions of Kubernetes or as pod workloads change.

I’d recommend tuning these settings from actual resource usage at a regular interval. This will make sure you’re both reserving enough resources and not wasting resources.

If you’re just starting out building a cluster then the data from these Kubernetes performance tests can help you with a good starting point for the kubelets usage.