Will Varney

Sat Jul 13, 2019 · 409 words

From past experience, I’ve seen people struggle with CPU limits in Kubernetes. What to set them to and the issues caused when they’re set too low.

When a pod hits its memory limit it is out of memory (OOM) killed. It’s clear in Kubernetes that the pod died. Inspecting the resource will show that it ran out of memory.

In contrast, with CPU limits a pods throttled. This usually goes unnoticed until it manifests in some other way.

Two interesting cases I’ve seen are both with kube-state-metrics. This tool queries the API server for information on Kubernetes resources. It exposes this information as a Prometheus endpoint.

I have seen the kube-state-metrics pod restart count increase a few times over the course of a day. Each time the pod restarted because the liveness probe failed. Manually curling the liveness probe endpoint was successful. This was an intermittent failure. Increasing the CPU limit resolves the problem.

In another similar incident, Prometheus was timing out when scraping kube-state-metrics. Again removing or increasing the CPU limit fixed the issue.

Both of these issues are from a too low CPU limit on the kube-state-metrics pod. Unlike memory, a workload gets throttled when it hits its CPU limit. This is not obvious and the issue doesn’t point towards CPU clearly.

How do we solve this? Could we ignore limits?

Limits and requests are the main method for preventing oversubscribing nodes. I always tell clients to set these. You’re no longer allocating workloads to nodes. You’re letting Kubernetes do this for you. If you don’t tell Kubernetes how much a workload uses it doesn’t know.

We have to lean on metrics. Be generous at first. Then once you have data revise. And continue to track as workloads change. Configure alerts for when limits are either reached or close.

One of the above examples was from a team using a supplied example deployment. They took this from the projects Github repository. When there was a new release they changed the version number and deployed. What they didn’t realise was that the example had it’s limits increased because the app had changed.

Don’t get caught out. Use limits but don’t set and forget.

Below are some links to further resources. I’ve included a great post on JVM memory footprint for if you’re running Java.

Kubernetes - Managing Compute Resources for Containers

Datadog - How to collect docker metrics - Throttled CPU

Spring - Memory footprint of the JVM