How Kubernetes picks which pods to delete during scale-in

Have you ever wondered how K8s choose which pods to delete when a deployment is scaled down? Given it is not documented, I dived in the source code to learn.

It doesn’t matter if the number of pods is changed manually in a Deployment or through a horizontal pod autoscaler (HPA): Kubernetes (K8s) must pick which pods to delete, and it doesn’t choose randomly.

Users can help K8s choose, setting a value for the annotation controller.kubernetes.io/pod-deletion-cost, which we will cover in the article, but it is just a part of the algorithm. Given the behavior is not properly documented, this article will explain it, based on the source code.

The article will dive deep into the details of the implementation: if you are not interested in the low-level working, at the end of the page there is a summary. If you want to customize the scale-in behavior, please jump to the section about Pod deletion cost.

Scaling-in

Scaling-in means reducing the number of pods available in a deployment. There could have been a spike in traffic in the application, so more pods were necessary to serve all the clients, and after the spike is gone, we can save resources by reducing the number of pods we request K8s to run.

This can be done manually, patching a Deployment with kubectl, or could be managed automatically by an autoscaler which will monitor some metrics to choose how many pods are necessary.

ReplicaSet

The logic for the scale-in of pods, behind the curtain, is managed by the ReplicaSet controller. When the number of expected pods decreases (but it doesn’t become zero), it does two things:

It creates a rank among the current list of pods managed by itself: this ranking will be one of the metrics used for sorting the pods, but not the only one nor the most important one;
It sorts the pods using 8 different rules (including the ranking above);

Ranking calculation

When the ReplicaSet sees that it must decrement the number of pods it manages, it invokes the function getPodsToDelete(filteredPods, relatedPods, diff)

The three arguments are:

filteredPods: the active pods managed by this ReplicaSet: inactive pods are filtered out because they won’t be deleted;
relatedPods: all the pods owned by any ReplicaSet which has the same owner of this same ReplicaSet, including its own.
diff: the number of pods to delete;

RelatedPods

The relatedPods informs the ranking, so let’s deep dive to understand better what they represent. They are calculated in a dedicated function that returns all pods that are owned by any ReplicaSet that is owned by the given ReplicaSet's owner.

This means that relatedPods is a superset of filteredPods: it will contain all the pods that are also in filteredPods, plus other if there is any other ReplicaSet with the same owner.

In this way, we have a list of all the pods that are somehow related to the ReplicaSet which is scaling-in. As an example, if a Helm Chart is managing two deployments, an application and a database, and the application is being scaled-in, the database pods are part of the relatedPods.

The getPodsToDelete function first invokes getPodsRankedByRelatedPodsOnSameNode, and then it sorts the pods.

The function getPodsRankedByRelatedPodsOnSameNode calculates a rank for each pod, based on how many relatedPodsare running on the same node. The rank is the number of active pods.

Two pods on the same node always have the same ranking.

If there is only one ReplicaSet with a given owner, then the ranking is simply the number of pods on each node: this means that if multiple pods are colocated on the same node, they will have a higher ranking of a pod running alone on a node.

Things become muddier when you have multiple ReplicaSet with the same owner. The ranking will be based on all the pods across the multiple ReplicaSet. Sticking with the example above, let’s say we have two nodes, and two ReplicaSet: a app and db, and we are scaling in the app.

If the pods are deployed in this way:

Node 1: app, db, db
Node 2: app, app

The app pod on the first node will have a ranking of 3, while the two app pods on the second node will have a ranking of 2.

Sorting

Now that the ReplicaSet has assigned a rank to each pod, it delegates the sorting to the ActivePodsWithRanks structure, that implements the sort.Sort() interface.

The logic of the sorting, that is what we are interested in, is all contained in the Less() implementation. The pods that will be sorted in front of the list will be the first ones to be deleted.

There are 8 different rules: when comparing two pods, each of them is applied in turn until one matches.

The first thing that is compared is if a pod is assigned to a node: the ones that are not assigned are deleted first;
Then, the phase of the pods is the next criteria. A pod in Pending state will be deleted before a pod in Unknown state, and the ones in Ready phase will be deleted last;
Then, the Ready status is compared: pods not Ready will be deleted before pods marked as Ready;
If the feature pod-deletion-cost is enabled, (we will speak about it later, as it is the only way to shape the choice of which pod to delete), the pod with a lower controller.kubernetes.io/pod-deletion-cost (if any), will be deleted first;
Then, Kubernetes uses the rank of the pod: we explained above, is the number of related pods running on the same node. The one with a higher rank will be deleted first;
Then, if both pods are Ready, the pod that has been ready for a shorter amount of time will be deleted before the pod that has been ready for longer;
Then, everything else equal, the pods that have restarted the most will be deleted first;
If nothing else matches, the pod that has been created most recently, according to the CreationTimestamp field, will be deleted first.

If all these 8 criteria are the same, so there is no clear indication of which pod should be deleted first, they are sorted by UUID to provide a pseudorandom order. The one that comes before in alphabetical order will be deleted first.

Pod deletion cost

The pod deletion cost is a feature introduced in Kubernetes v1.22, currently in the Beta state and enabled by default.

It allows users to set an annotation on a pod, controller.kubernetes.io/pod-deletion-cost, which represents the cost of deleting a pod. The cost can be any value between -2147483648 and 2147483647, and the pods with a lower value will be deleted first.

As we have seen, this is on a best-effort basis: it is not the first criteria that Kubernetes will use to pick a pod to delete.

You shouldn’t update this value too often, to not put too much pressure on the api-server, and you shouldn’t update this value manually: if you want to use it, I suggest writing some controller that implements the logic you’d like to see applied.

Summary

Long story short, the algorithm compares all the pods and orders them following these criteria:

Unassigned < assigned;
PodPending < PodUnknown < PodRunning;
Not ready < ready;
Lower pod-deletion-cost < higher pod-deletion cost;
Doubled up < not doubled up;
Been ready for empty time < less time < more time;
Pods with containers with higher restart counts < lower restart counts;
Empty creation time pods < newer pods < older pods;

The first pod in the list will be the first to be deleted;

I hope you found this article somehow useful, please let me know in the comments if you have any feedback or suggestions, or any other questions!

Ciao,