Understanding RWO block device handling in OpenShift
- By: Toni Schmidbauer ( Lastmod: 2021-08-14 ) - 4 min read
In this blog post we would like to explore OpenShift / Kubernetes block device handling. We try to answer the following questions:
What happens if multiple pods try to access the same block device?
What happens if we scale a deployment using block devices to more than one replica?
And finally we want to give a short, high level overview about how the container storage interface (CSI) actually works.
|A block device provides Read-Write-Once (RWO) storage. This basically means a local file system mounted by a single node. Do not confuse this with a cluster (CephFS, GlusterFS) or network file system (NFS). These file systems provide Read-Write-Many (RWX) storage mountable on more than one node.|
For running our tests we need the following resources
A new namespace/project for running our tests
A persistent volume claim (PVC) to be mounted in our test pods
Two pods definitions for mounting the PVC
Step 1: Creating a new namespace/project
To run our test cases we created a new project with OpenShift
oc new-project blockdevices
Step 2: Defining a block PVC
Our cluster is running the rook operator (https://rook.io) and provides a ceph-block storage class for creating block devices:
$ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate false 4d14h
Let’s take a look a the details of the storage class:
$ oc get sc -o yaml ceph-block apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ceph-block parameters: clusterID: rook-ceph csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph csi.storage.k8s.io/fstype: ext4 (1) csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph imageFeatures: layering imageFormat: "2" pool: blockpool provisioner: rook-ceph.rbd.csi.ceph.com reclaimPolicy: Delete volumeBindingMode: Immediate
|1||So whenever we create a PVC using this storage class the Ceph provisioner will also create an EXT4 file system on the block device.|
To test block device handling we create the following persistent volume claim (PVC):
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: block-claim spec: accessModes: - ReadWriteOnce (1) resources: requests: storage: 1Gi storageClassName: ceph-block
|1||The access mode is set to ReadWriteOnce (RWO), as block devices|
oc create -f pvc.yaml
$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE block-claim Bound pvc-bd68be5d-c312-4c31-86a8-63a0c22de844 1Gi RWO ceph-block 91s
To test our shiny new block device we are going to use the following three pod definitions:
apiVersion: v1 kind: Pod metadata: labels: run: block-pod-a name: block-pod-a spec: containers: - image: registry.redhat.io/ubi8/ubi:8.3 name: block-pod-a command: - sh - -c - 'df -h /block && findmnt /block && sleep infinity' volumeMounts: - name: blockdevice mountPath: /block volumes: - name: blockdevice persistentVolumeClaim: claimName: block-claim
apiVersion: v1 kind: Pod metadata: labels: run: block-pod-b name: block-pod-b spec: affinity: podAntiAffinity: (1) requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: run operator: In values: - block-pod-a topologyKey: kubernetes.io/hostname containers: - image: registry.redhat.io/ubi8/ubi:8.3 name: block-pod-b command: - sh - -c - 'df -h /block && findmnt /block && sleep infinity' volumeMounts: - name: blockdevice mountPath: /block volumes: - name: blockdevice persistentVolumeClaim: claimName: block-claim
|1||We use an AntiAffinity rule for making sure that block-pod-b runs on a different node than block-pod-a.|
apiVersion: v1 kind: Pod metadata: labels: run: block-pod-c name: block-pod-c spec: affinity: podAffinity: (1) preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: run operator: In values: - block-pod-a topologyKey: kubernetes.io/hostname containers: - image: registry.redhat.io/ubi8/ubi:8.3 name: block-pod-c command: - sh - -c - 'df -h /block && findmnt /block && sleep infinity' volumeMounts: - name: blockdevice mountPath: /block volumes: - name: blockdevice persistentVolumeClaim: claimName: block-claim
|1||We use an Affinity rule for making sure that block-pod-c runs on the same node as block-pod-a.|
In our first test we want to make sure that both pods are running on separate cluster nodes. So we create block-pod-a and block-pod-b:
$ oc create -f block-pod-a.yml $ oc create -f block-pod-b.yml
After a few seconds we can check the state of our pods:
$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES block-pod-a 1/1 Running 0 46s 10.130.6.4 infra02.lan.stderr.at <none> <none> block-pod-b 0/1 ContainerCreating 0 16s <none> infra01 <none> <none>
Hm, block-pod-b is in the state ContainerCreating, let’s check the events. Also note that it is running on another node (infra01) then block-pod-a (infra02).
10s Warning FailedAttachVolume pod/block-pod-b Multi-Attach error for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844" Volume is already used by pod(s) block-pod-a
Ah, so because of our block device with RWO access mode and block-pod-b running on separate cluster node, OpenShift or K8s can’t attach the volume to our block-pod-b.
But let’s try another test and let’s create a third pod block-pod-c that should run on the same node as block-pod-a:
$ oc create -f block-pod-c.yml
Now let’s check the status of block-pod-c:
$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES block-pod-a 1/1 Running 0 6m49s 10.130.6.4 infra02.lan.stderr.at <none> <none> block-pod-b 0/1 ContainerCreating 0 6m19s <none> infra01 <none> <none> block-pod-c 1/1 Running 0 14s 10.130.6.5 infra02.lan.stderr.at <none> <none>
Oh, block-pod-c is running on node infra02 and mounted the RWO volume. Let’s check the events for block-pod-c:
3m6s Normal Scheduled pod/block-pod-c Successfully assigned blockdevices/block-pod-c to infra02.lan.stderr.at 2m54s Normal AddedInterface pod/block-pod-c Add eth0 [10.130.6.5/23] 2m54s Normal Pulled pod/block-pod-c Container image "registry.redhat.io/ubi8/ubi:8.3" already present on machine 2m54s Normal Created pod/block-pod-c Created container block-pod-c 2m54s Normal Started pod/block-pod-c Started container block-pod-c
When we compare this with the events for block-pod-a:
9m41s Normal Scheduled pod/block-pod-a Successfully assigned blockdevices/block-pod-a to infra02.lan.stderr.at 9m41s Normal SuccessfulAttachVolume pod/block-pod-a AttachVolume.Attach succeeded for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844" 9m34s Normal AddedInterface pod/block-pod-a Add eth0 [10.130.6.4/23] 9m34s Normal Pulled pod/block-pod-a Container image "registry.access.redhat.com/ubi8/ubi:8.3" already present on machine 9m34s Normal Created pod/block-pod-a Created container block-pod-a 9m34s Normal Started pod/block-pod-a Started container block-pod-a
So the AttachVolume.Attach message is missing in the events for block-pod-c. Because the volume is already attached to the node, interesting.
|Even with RWO block device volumes it is possible to use the same volume in multiple pods if the pods a running on the same node.|
I was not aware of this possibility and always had the believe with an RWO block device only one pod can access the volume. That’s the problem with believing :-)
Thanks or reading this far.