Pod Placement
Working with Environments
Imagine you have one OpenShift cluster and you would like to create 2 or more environments inside this cluster, but also separate them and force the environments to specific nodes, or use specific inbound routers. All this can be achieved using labels, IngressControllers and so on. The following article will guide you to set up dedicated compute nodes for infrastructure, development and test environments as well as the creation of IngressController which are bound to the appropriate nodes.
Introduction
Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster. It is probably one of the most important tasks for a Day-2 scenario and should be considered at a very early stage for a new cluster. OpenShift/Kubernetes is already shipped with a default scheduler which schedules pods as they get created accross the cluster, without any manual steps.
However, there are scenarios where a more advanced approach is required, like for example using a specifc group of nodes for dedicated workload or make sure that certain applications do not run on the same nodes. Kubernetes provides different options:
Controlling placement with node selectors
Controlling placement with pod/node affinity/anti-affinity rules
Controlling placement with taints and tolerations
Controlling placement with topology spread constraints
This series will try to go into the detail of the different options and explains in simple examples how to work with pod placement rules. It is not a replacement for any official documentation, so always check out Kubernetes and or OpenShift documentations.
Node Affinity
Node Affinity allows to place a pod to a specific group of nodes. For example, it is possible to run a pod only on nodes with a specific CPU or disktype. The disktype was used as an example for the nodeSelector
and yes … Node Affinity is conceptually similar to nodeSelector but allows a more granular configuration.
NodeSelector
One of the easiest ways to tell your Kubernetes cluster where to put certain pods is to use a nodeSelector
specification. A nodeSelector defines a key-value pair and are defined inside the specification of the pods and as a label on one or multiple nodes (or machine set or machine config). Only if selector matches the node label, the pod is allowed to be scheduled on that node.
Pod Affinity/Anti-Affinity
While noteSelector provides a very easy way to control where a pod shall be scheduled, the affinity/anti-affinity feature, expands this configuration with more expressive rules like logical AND operators, constraints against labels on other pods or soft rules instead of hard requirements.
The feature comes with two types:
pod affinity/anti-affinity - allows constrains against other pod labels rather than node labels.
node affinity - allows pods to specify a group of nodes they can be placed on
Taints and Tolerations
While Node Affinity is a property of pods that attracts them to a set of nodes, taints are the exact opposite. Nodes can be configured with one or more taints, which mark the node in a way to only accept pods that do tolerate the taints. The tolerations themselves are applied to pods, telling the scheduler to accept a taints and start the workload on a tainted node.
A common use case would be to mark certain nodes as infrastructure nodes, where only specific pods are allowed to be executed or to taint nodes with a special hardware (i.e. GPU).
Topology Spread Constraints
Topology spread constraints is a new feature since Kubernetes 1.19 (OpenShift 4.6) and another way to control where pods shall be started. It allows to use failure-domains, like zones or regions or to define custom topology domains. It heavily relies on configured node labels, which are used to define topology domains. This feature is a more granular approach than affinity, allowing to achieve higher availability.
Using Descheduler
Descheduler is a new feature which is GA since OpenShift 4.7. It can be used to evict pods from nodes based on specific strategies. The evicted pod is then scheduled on another node (by the Scheduler) which is more suitable.
This feature can be used when:
nodes are under/over-utilized
pod or node affinity, taints or labels have changed and are no longer valid for a running pod
node failures
pods have been restarted too many times
Copyright © 2020 - 2024 Toni Schmidbauer & Thomas Jungbauer