Kubernetes in production — PodDisruptionBudget
How to manage disruptions in Kubernetes? Setting a proper RollingUpdate strategy specs solves only one type of disruption. What about other disruptions like deleting deployments on accident, network partitions, kernel panics, hardware failures, pod evictions?
PodDisruptionBudget for the win!
Voluntary and Involuntary Disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
- a hardware failure of the physical machine backing the node
- cluster administrator deletes VM (instance) by mistake
- cloud provider or hypervisor failure makes VM disappear
- a kernel panic
- the node disappears from the cluster due to cluster network partition
- eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment’s pod template causing a restart
- directly deleting a pod (e.g. by accident)
Cluster Administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
- Removing a pod from a node to permit something else to fit on that node.### Dealing with Disruptions
Here are some ways to mitigate involuntary disruptions:
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
PodDisruptionBudget
An Application Owner can create a PodDisruptionBudget
object (PDB) for each application. A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
Cluster managers and hosting providers should use tools which respect Pod Disruption Budgets by calling the Eviction API instead of directly deleting pods. Examples are the kubectl drain
command and the Kubernetes-on-GCE cluster upgrade script cluster/gce/upgrade.sh
.
When a cluster administrator wants to drain a node they use the kubectl drain
command. That tool tries to evict all the pods on the machine. The eviction request may be temporarily rejected, and the tool periodically retries all failed requests until all pods are terminated, or until a configurable timeout is reached.
Example PDB Using minAvailable:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: zookeeper
Example PDB Using maxUnavailable (Kubernetes 1.7 or higher):
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: zookeeper
Helm
use this in your Chart!
// templates/pdb.yaml
{{- if .Values.budget.minAvailable -}}
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: {{ template "app.fullname" . }}
namespace: {{ .Values.namespace }}
labels:
app: {{ template "app.name" . }}
chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
selector:
matchLabels:
app: {{ template "app.name" . }}
env: {{ .Values.env.name }}
minAvailable: **{{ .Values.budget.minAvailable }}**
{{- end -}}
Imagine you have a service with 2 replicas and you need at least 1 to be available even during node upgrades and other ops tasks.
install / upgrade your release:
helm upgrade --install --debug "$RELEASE_NAME" -f helm/values.yaml \
--set replicas=2,budget.minAvailable=1** myrepo/mychart
run kubectl describe pdb "$RELEASE_NAME"
Name: mysvc-prod
Namespace: prod
Min available: 1
Selector: app=myservice,env=prod
Status:
Allowed disruptions: 1
Current: 2
Desired: 1
Total: 2
Events: <none>
drain a node with one of your pods running:
kubectl drain --delete-local-data --force --ignore-daemonsets gke-mycluster-prod-pool-2fca4c85-k6g5
node "gke-mycluster-prod-pool-2fca4c85-k6g5" already cordoned
WARNING: Deleting pods with local storage: sqlproxy-67f695889d-t778w; Ignoring DaemonSet-managed pods: fluentd-gcp-v3.0.0-llp5s; Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-gke-testing-dev-pool-2fca4c85-k6g5
pod "tiller-deploy-7b7b795779-rcvkd" evicted
pod "mysvc-prod-6856d59f9b-lzrtf" evicted
node "gke-mycluster-prod-pool-2fca4c85-k6g5" drained
again run kubectl describe pdb "$RELEASE_NAME"
Name: mysvc-prod
Namespace: prod
Min available: 1
Selector: app=myservice,env=prod
Status:
Allowed disruptions: 0
Current: 1
Desired: 1
Total: 2
Events: <none>`
Tadaaa! We drained a node without any disruptions of our service.
PDB with 1 replica only?
If we had 1 replica only, the kubectl drain would get stuck always. Node drains / upgrades would need to be solved manually.
You might expect the eviction API would try to surge a replica to comply with the minAvailable condition, instead the drain gets stuck and it is your responsibility to solve this situation yourself. Is it a bug or feature? The Kubernetes community says you shouldn’t use 1 replica in production at all if you want HA, which is fair :)
It does what is expected, though.
If you don’t want your kubectl drains to get stuck, you might want to use PDB for deployments with more than 1 replica.
Edit your helm template
// templates/pdb.yaml
{{- if .Values.budget.minAvailable -}}
**{{- if gt .Values.replicaCount 1 -}}**
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
...
{{- end -}}
{{- end -}}
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:
Accept downtime during the upgrade.
Fail over to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes, and for human effort to orchestrate the switchover.
Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.
Links
docs https://kubernetes.io/docs/concepts/workloads/pods/disruptions/