Lei Zhilong

The best way to input is to output

May 10, 2020 - 5 minute read - Comments - Kubernetes

Why Swap should be disabled on Kubernetes

Issue Description

Since Kubernetes 1.8, a kubelet flag fail-swap-on has been set to a default value true, which means that swap is not supported by default on Kubernetes. SWAP is used on Unix and Linux by default since they were born. People are astonished to learn that swap has to be disabled on Kubernetes which is supposed to be able to facilitate the full ablity of Linux system. There are several issues on Github talking about this change since 1.8, unfortunately no pregress has been made officially. Here I’m trying to understand the whole story and here’s some information I found.

Why

As I understand it, the reason of disabling swap is that the whole Kuberentes resource QoS(Quality of Service) policy implementation is designed on the base of explicitly resource limiting and scheduling. As we all konw, there are 3 classes of QoS for each Pod on Kuberentes:

  • Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
  • Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
  • Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.

Accroding to this design, Guaranteed pods should never have to worry about memory and Best-Effort pods should only run on nodes with free memory. While Burstable pods exceed their memory requests, they should either be allocated with more memory within the limits if it’s possible for the underlay nodes to spare or be killed under pressure. All of these pods should not be faced with the decision whether or not to use swap.

Of course, if kubelet is smart enough and linux kernel provides deterministic isolation behavior for swap spaces, swap might be a good option. But even if it’s technically possible, there definitely are tons of job to do. “Support for swap is non-trivial”, says Kubernetes community. Considering the effort have to make swap usable and the gains it could realize, optimizing for swap is given much lower priority compared to improving reliability around pressure detection, optimizing issues around latency or other similar features. That’s why the Kuberentes issue #53533 on Github has been open for quite a longtime. There is even a description in official desgin documentation to address this issue

The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.

To summarize, the lack of swap support lies in the fact that swap usage is not even expected in Kubernetes and there are enormous work to be done before swap can be used in product scenarios. IMHO, these work are no just about Kubernets itself but also enven more about Linux kernel. It might be problem for the Kubernetes community to find a strong motivation to tackle this issue considering the huge amount of efforts ahead.

Arguement

Even though there is still a longtime to expect before swap get officially supported, the communities have been providing more and more scenarios that prove the necessarity. One of the cases is that some of the applications are designed to make use of swap in order to handle tasks using up 10 or 100 more times memory in peak time than to handle normal tasks. In this case, it might be impossible to get enough phisycal memory prepared ahead of time since the costs might be unacceptable. In this scenario, the need of swap is crucial and the absence of swap support might drive users away from Kubernetes.

In my case, I’ve been experiencing some strange behaviors of my workload containers in which MySQL server runs with swap enabled and resources guaranteed. Since we have been using kubernetes for more than 2 years, the Kubernetes version is quite outdated and disabling swap is still not a default setting. What we have inspected is that even if there is sufficient memory for the workload, the swap usage keeps going up slowly util the process get killed by OOMKiller. I guess that might be one of the reasons handling swap is pretty complicated.

As far as I’m concerned, the support of Swap is not just about turning on some flags of Kubernetes but involves with how to establish a whole michanism of swap utilization, limitation, regulation and isolation from Linux kernel,Kubernetes to application layer. Before that can be done, I would argue that people should be cautious to run workloads in Kubernetes with swap on.

Reference

Here’s some links to the community discussions. Jump in, it’s still goning on.

(PS., the kubelet’s flag fail-swap-on was deprecated and moved to the config file specified by the --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)