Understanding RWO block device handling in OpenShift
- - 4 min read
In this blog post we would like to explore OpenShift / Kubernetes block device handling. We try to answer the following questions:
What happens if multiple pods try to access the same block device?
What happens if we scale a deployment using block devices to more than one replica?
And finally we want to give a short, high level overview about how the container storage interface (CSI) actually works.
A block device provides Read-Write-Once (RWO) storage. This basically means a local file system mounted by a single node. Do not confuse this with a cluster (CephFS, GlusterFS) or network file system (NFS). These file systems provide Read-Write-Many (RWX) storage mountable on more than one node. |
Test setup
For running our tests we need the following resources
A new namespace/project for running our tests
A persistent volume claim (PVC) to be mounted in our test pods
Two pods definitions for mounting the PVC
Step 1: Creating a new namespace/project
To run our test cases we created a new project with OpenShift
oc new-project blockdevices
Step 2: Defining a block PVC
Our cluster is running the rook operator (https://rook.io) and provides a ceph-block storage class for creating block devices:
$ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate false 4d14h
Let’s take a look a the details of the storage class:
$ oc get sc -o yaml ceph-block
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block
parameters:
clusterID: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/fstype: ext4 (1)
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
imageFeatures: layering
imageFormat: "2"
pool: blockpool
provisioner: rook-ceph.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
1 | So whenever we create a PVC using this storage class the Ceph provisioner will also create an EXT4 file system on the block device. |
To test block device handling we create the following persistent volume claim (PVC):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: block-claim
spec:
accessModes:
- ReadWriteOnce (1)
resources:
requests:
storage: 1Gi
storageClassName: ceph-block
1 | The access mode is set to ReadWriteOnce (RWO), as block devices |
oc create -f pvc.yaml
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
block-claim Bound pvc-bd68be5d-c312-4c31-86a8-63a0c22de844 1Gi RWO ceph-block 91s
To test our shiny new block device we are going to use the following three pod definitions:
apiVersion: v1
kind: Pod
metadata:
labels:
run: block-pod-a
name: block-pod-a
spec:
containers:
- image: registry.redhat.io/ubi8/ubi:8.3
name: block-pod-a
command:
- sh
- -c
- 'df -h /block && findmnt /block && sleep infinity'
volumeMounts:
- name: blockdevice
mountPath: /block
volumes:
- name: blockdevice
persistentVolumeClaim:
claimName: block-claim
apiVersion: v1
kind: Pod
metadata:
labels:
run: block-pod-b
name: block-pod-b
spec:
affinity:
podAntiAffinity: (1)
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: run
operator: In
values:
- block-pod-a
topologyKey: kubernetes.io/hostname
containers:
- image: registry.redhat.io/ubi8/ubi:8.3
name: block-pod-b
command:
- sh
- -c
- 'df -h /block && findmnt /block && sleep infinity'
volumeMounts:
- name: blockdevice
mountPath: /block
volumes:
- name: blockdevice
persistentVolumeClaim:
claimName: block-claim
1 | We use an AntiAffinity rule for making sure that block-pod-b runs on a different node than block-pod-a. |
apiVersion: v1
kind: Pod
metadata:
labels:
run: block-pod-c
name: block-pod-c
spec:
affinity:
podAffinity: (1)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: run
operator: In
values:
- block-pod-a
topologyKey: kubernetes.io/hostname
containers:
- image: registry.redhat.io/ubi8/ubi:8.3
name: block-pod-c
command:
- sh
- -c
- 'df -h /block && findmnt /block && sleep infinity'
volumeMounts:
- name: blockdevice
mountPath: /block
volumes:
- name: blockdevice
persistentVolumeClaim:
claimName: block-claim
1 | We use an Affinity rule for making sure that block-pod-c runs on the same node as block-pod-a. |
In our first test we want to make sure that both pods are running on separate cluster nodes. So we create block-pod-a and block-pod-b:
$ oc create -f block-pod-a.yml
$ oc create -f block-pod-b.yml
After a few seconds we can check the state of our pods:
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
block-pod-a 1/1 Running 0 46s 10.130.6.4 infra02.lan.stderr.at <none> <none>
block-pod-b 0/1 ContainerCreating 0 16s <none> infra01 <none> <none>
Hm, block-pod-b is in the state ContainerCreating, let’s check the events. Also note that it is running on another node (infra01) then block-pod-a (infra02).
10s Warning FailedAttachVolume pod/block-pod-b Multi-Attach error for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844" Volume is already used by pod(s) block-pod-a
Ah, so because of our block device with RWO access mode and block-pod-b running on separate cluster node, OpenShift or K8s can’t attach the volume to our block-pod-b.
But let’s try another test and let’s create a third pod block-pod-c that should run on the same node as block-pod-a:
$ oc create -f block-pod-c.yml
Now let’s check the status of block-pod-c:
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
block-pod-a 1/1 Running 0 6m49s 10.130.6.4 infra02.lan.stderr.at <none> <none>
block-pod-b 0/1 ContainerCreating 0 6m19s <none> infra01 <none> <none>
block-pod-c 1/1 Running 0 14s 10.130.6.5 infra02.lan.stderr.at <none> <none>
Oh, block-pod-c is running on node infra02 and mounted the RWO volume. Let’s check the events for block-pod-c:
3m6s Normal Scheduled pod/block-pod-c Successfully assigned blockdevices/block-pod-c to infra02.lan.stderr.at
2m54s Normal AddedInterface pod/block-pod-c Add eth0 [10.130.6.5/23]
2m54s Normal Pulled pod/block-pod-c Container image "registry.redhat.io/ubi8/ubi:8.3" already present on machine
2m54s Normal Created pod/block-pod-c Created container block-pod-c
2m54s Normal Started pod/block-pod-c Started container block-pod-c
When we compare this with the events for block-pod-a:
9m41s Normal Scheduled pod/block-pod-a Successfully assigned blockdevices/block-pod-a to infra02.lan.stderr.at
9m41s Normal SuccessfulAttachVolume pod/block-pod-a AttachVolume.Attach succeeded for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844"
9m34s Normal AddedInterface pod/block-pod-a Add eth0 [10.130.6.4/23]
9m34s Normal Pulled pod/block-pod-a Container image "registry.access.redhat.com/ubi8/ubi:8.3" already present on machine
9m34s Normal Created pod/block-pod-a Created container block-pod-a
9m34s Normal Started pod/block-pod-a Started container block-pod-a
So the AttachVolume.Attach message is missing in the events for block-pod-c. Because the volume is already attached to the node, interesting.
Even with RWO block device volumes it is possible to use the same volume in multiple pods if the pods a running on the same node. |
I was not aware of this possibility and always had the believe with an RWO block device only one pod can access the volume. That’s the problem with believing :-)
Thanks or reading this far.
Copyright © 2020 - 2024 Toni Schmidbauer & Thomas Jungbauer