NodeSelector

- By: Thomas Jungbauer ( Lastmod: 2021-09-03 )

One of the easiest ways to tell your Kubernetes cluster where to put certain pods is to use a nodeSelector specification. A nodeSelector defines a key-value pair and are defined inside the specification of the pods and as a label on one or multiple nodes (or machine set or machine config). Only if selector matches the node label, the pod is allowed to be scheduled on that node.

Kubernetes distingushes between 2 types of selectors:

  1. cluster-wide node selectors: defined by the cluster administrators and valid for the whole cluster

  2. project node selectors: to place new pods inside projects into specific nodes.

Using nodeSelector

As previously described, we have a cluster with an example application scheduled accross the worker nodes evenly by the scheduler.

oc get pods -n podtesting -o wide | grep Running

django-psql-example-1-842fl    1/1     Running             0          2m7s   10.131.0.65   compute-3   <none>           <none>
django-psql-example-1-h6kst    1/1     Running             0          24m    10.130.2.97   compute-2   <none>           <none>
django-psql-example-1-pxhlv    1/1     Running             0          2m7s   10.128.2.13   compute-0   <none>           <none>
django-psql-example-1-xms7x    1/1     Running             0          2m7s   10.129.2.10   compute-1   <none>           <none>
postgresql-1-4pcm4             1/1     Running             0          26m    10.131.0.51   compute-3   <none>           <none>

However, our 4 compute nodes are assembled with different hardware specification and are using different harddisks (sdd vs hdd).

Node with different disktypes
Figure 1. Nodes with Different Specifications

Since our web application must run on fast disks must configure the cluster to schedule the pods on nodes with SSD only.

To start using nodeSelectors we first label our nodes accordingly:

  • compute-0 and compute-1 are faster nodes with an SSD attached.

  • compute-2 and compute-2 have a HDD attached.

oc label nodes compute-0 compute-1 disktype=ssd (1)

oc label nodes compute-2 compute-3 disktype=hdd
1 as key we are using disktype

As crosscheck we can list nodes with a specific label:

oc get nodes -l disktype=ssd
NAME        STATUS   ROLES    AGE     VERSION
compute-0   Ready    worker   7h32m   v1.19.0+d59ce34
compute-1   Ready    worker   7h31m   v1.19.0+d59ce34

oc get nodes -l disktype=hdd
NAME        STATUS   ROLES    AGE     VERSION
compute-2   Ready    worker   7h32m   v1.19.0+d59ce34
compute-3   Ready    worker   7h32m   v1.19.0+d59ce34
If no matching label is found, the pod cannot be scheduled. Therefore, always label the nodes first.

The 2nd step is to add the node selector to the specification of the pod. In our example we are using a DeploymentConfig, so let’s add it there:

oc patch dc django-psql-example -n podtesting --patch '{"spec":{"template":{"spec":{"nodeSelector":{"disktype":"ssd"}}}}}'

This adds the nodeSelector into: spec/template/spec

       nodeSelector:
         disktype: ssd

Kubernetes will now trigger a restart of the pods on the supposed nodes.

oc get pods -n podtesting -o wide | grep Running

django-psql-example-3-4j92k    1/1     Running       0          42s   10.129.2.7    compute-1   <none>           <none>
django-psql-example-3-d7hsd    1/1     Running       0          42s   10.129.2.8    compute-1   <none>           <none>
django-psql-example-3-fkbfm    1/1     Running       0          14m   10.128.2.18   compute-0   <none>           <none>
django-psql-example-3-psskb    1/1     Running       0          14m   10.128.2.17   compute-0   <none>           <none>

As you can see, only nodes with a SSD (compute-0 and compute-1) are being used.

Controlling pod placement with project-wide selector

Adding a nodeSelector to a deployment seems fine…​ until somebody forgets to add it. Then the pods would be started anywhere the scheduler finds suitable. Therefore, it might make sense to use a project-wide node selector, which will automatically be applied on all pods on that project. The project selector is added by the cluster administrator to the Namespace object (no matter what the OpenShift documentation says in it’s example) as openshift.io/node-selector parameter.

Let’s remove our previous configuration and add the setting to our namespace podtesting:

  1. Cleanup

    Remove the nodeSelector from the deployment configuration and wait until all pods have been reshuffeld

    oc patch dc django-psql-example -n podtesting --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector", "value": "disktype=ssd" }]'
  2. Add the label to the project

    oc annotate ns/podtesting openshift.io/node-selector="disktype=ssd"

The OpenShift scheduler will now spread the Pods accross compute-0 or compute-1 again, but not on compute-2 or 3.

We can prove that by stressing our cluster (and nodes) a little bit and scale our frontend application to 10:

oc get pods -n podtesting -o wide | grep Running
django-psql-example-4-2jn2l    1/1     Running     0          27s     10.128.2.8    compute-0   <none>           <none>
django-psql-example-4-6g7ks    1/1     Running     0          7m47s   10.129.2.23   compute-1   <none>           <none>
django-psql-example-4-752nm    1/1     Running     0          7m47s   10.128.2.7    compute-0   <none>           <none>
django-psql-example-4-c5jvm    1/1     Running     0          27s     10.129.2.4    compute-1   <none>           <none>
django-psql-example-4-f5kwg    1/1     Running     0          27s     10.129.2.5    compute-1   <none>           <none>
django-psql-example-4-g7bcs    1/1     Running     0          7m47s   10.129.2.24   compute-1   <none>           <none>
django-psql-example-4-h5tgb    1/1     Running     0          27s     10.129.2.6    compute-1   <none>           <none>
django-psql-example-4-spvpp    1/1     Running     0          28s     10.128.2.5    compute-0   <none>           <none>
django-psql-example-4-v9qwj    1/1     Running     0          7m48s   10.129.2.22   compute-1   <none>           <none>
django-psql-example-4-zgwcv    1/1     Running     0          27s     10.128.2.6    compute-0   <none>           <none>

As you can see compute-0 and compute-1 are the only nodes which are used.

Well-Known Labels

nodeSelector is one of the easiest ways to control where an application shall be started. Working with labels is therefore very important as soon as workload shall be added to the cluster. Kubernetes reserves some labels which can be leveraged and some are already predefined on the nodes, for example:

  • beta.kubernetes.io/arch=amd64

  • kubernetes.io/hostname=compute-0

  • kubernetes.io/os=linux

  • node-role.kubernetes.io/worker=

  • node.openshift.io/os_id=rhcos

A list of all known can be found at: [1]

Two of them I would like to mention here, since they might become very important when designing the placement of pods:

  • topology.kubernetes.io/zone

  • topology.kubernetes.io/region

With these two labels you can create availability zones for your cluster. A zone can be seen a logical failure domain and a cluster is typically spanned across multiple zones. This could be a rack in a data center for example, hardware which is sharing the same switch or simply different data centers. Zones are seen as independent to each other.

A region is made up of one or more zones. A cluster is usually not spanned across multiple region.

Kubernetes makes a few assumptions about the structure of zones and regions:

  • regions and zones are hierarchical: zones are strict subsets of regions and no zone can be in 2 regions

  • zone names are unique across regions; for example region "africa-east-1" might be comprised of zones "africa-east-1a" and "africa-east-1b"

Cleanup

This concludes the chapter about nodeSelectors. For the next chapter of the Pod Placement Series (Pod Affinity and Anti Affinity) we need to cleanup our configuration.

  1. Scale the frontend down to 2

    oc scale --replicas=2 dc/django-psql-example -n podtesting
  2. Remove the label from the namespace

    oc annotate ns/podtesting openshift.io/node-selector- (1)
    1 The minus at the end defines that this annotation shall be removed
  3. And, just to be sure if you have not done this before, remove the nodeSelector from the DeploymentConfig

    oc patch dc django-psql-example -n podtesting --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector", "value": "disktype=ssd" }]'