Understanding RWO block device handling in OpenShift

- By: Toni Schmidbauer ( Lastmod: 2021-08-14 )

In this blog post we would like to explore OpenShift / Kubernetes block device handling. We try to answer the following questions:

  • What happens if multiple pods try to access the same block device?

  • What happens if we scale a deployment using block devices to more than one replica?

And finally we want to give a short, high level overview about how the container storage interface (CSI) actually works.

A block device provides Read-Write-Once (RWO) storage. This basically means a local file system mounted by a single node. Do not confuse this with a cluster (CephFS, GlusterFS) or network file system (NFS). These file systems provide Read-Write-Many (RWX) storage mountable on more than one node.

Test setup

For running our tests we need the following resources

  • A new namespace/project for running our tests

  • A persistent volume claim (PVC) to be mounted in our test pods

  • Two pods definitions for mounting the PVC

Step 1: Creating a new namespace/project

To run our test cases we created a new project with OpenShift

oc new-project blockdevices

Step 2: Defining a block PVC

Our cluster is running the rook operator (https://rook.io) and provides a ceph-block storage class for creating block devices:

$ oc get sc
NAME                 PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ceph-block           rook-ceph.rbd.csi.ceph.com     Delete          Immediate              false                  4d14h

Let’s take a look a the details of the storage class:

$ oc get sc -o yaml ceph-block
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-block
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4 (1)
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  imageFeatures: layering
  imageFormat: "2"
  pool: blockpool
provisioner: rook-ceph.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
1 So whenever we create a PVC using this storage class the Ceph provisioner will also create an EXT4 file system on the block device.

To test block device handling we create the following persistent volume claim (PVC):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: block-claim
spec:
  accessModes:
    - ReadWriteOnce (1)
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-block
1 The access mode is set to ReadWriteOnce (RWO), as block devices
oc create -f pvc.yaml
$ oc get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
block-claim   Bound    pvc-bd68be5d-c312-4c31-86a8-63a0c22de844   1Gi        RWO            ceph-block     91s

To test our shiny new block device we are going to use the following three pod definitions:

block-pod-a
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: block-pod-a
  name: block-pod-a
spec:
  containers:
  - image: registry.redhat.io/ubi8/ubi:8.3
    name: block-pod-a
    command:
      - sh
      - -c
      - 'df -h /block && findmnt /block && sleep infinity'
    volumeMounts:
    - name: blockdevice
      mountPath: /block
  volumes:
  - name: blockdevice
    persistentVolumeClaim:
      claimName: block-claim
block-pod-b
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: block-pod-b
  name: block-pod-b
spec:
  affinity:
    podAntiAffinity: (1)
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: run
                operator: In
                values:
                  - block-pod-a
          topologyKey: kubernetes.io/hostname
  containers:
  - image: registry.redhat.io/ubi8/ubi:8.3
    name: block-pod-b
    command:
      - sh
      - -c
      - 'df -h /block && findmnt /block && sleep infinity'
    volumeMounts:
    - name: blockdevice
      mountPath: /block
  volumes:
  - name: blockdevice
    persistentVolumeClaim:
      claimName: block-claim
1 We use an AntiAffinity rule for making sure that block-pod-b runs on a different node than block-pod-a.
block-pod-c
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: block-pod-c
  name: block-pod-c
spec:
  affinity:
    podAffinity: (1)
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: run
              operator: In
              values:
              - block-pod-a
          topologyKey: kubernetes.io/hostname
  containers:
  - image: registry.redhat.io/ubi8/ubi:8.3
    name: block-pod-c
    command:
      - sh
      - -c
      - 'df -h /block && findmnt /block && sleep infinity'
    volumeMounts:
    - name: blockdevice
      mountPath: /block
  volumes:
  - name: blockdevice
    persistentVolumeClaim:
      claimName: block-claim
1 We use an Affinity rule for making sure that block-pod-c runs on the same node as block-pod-a.

In our first test we want to make sure that both pods are running on separate cluster nodes. So we create block-pod-a and block-pod-b:

$ oc create -f block-pod-a.yml
$ oc create -f block-pod-b.yml

After a few seconds we can check the state of our pods:

$ oc get pods -o wide
NAME          READY   STATUS              RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
block-pod-a   1/1     Running             0          46s   10.130.6.4   infra02.lan.stderr.at   <none>           <none>
block-pod-b   0/1     ContainerCreating   0          16s   <none>       infra01                 <none>           <none>

Hm, block-pod-b is in the state ContainerCreating, let’s check the events. Also note that it is running on another node (infra01) then block-pod-a (infra02).

10s         Warning   FailedAttachVolume       pod/block-pod-b                     Multi-Attach error for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844" Volume is already used by pod(s) block-pod-a

Ah, so because of our block device with RWO access mode and block-pod-b running on separate cluster node, OpenShift or K8s can’t attach the volume to our block-pod-b.

But let’s try another test and let’s create a third pod block-pod-c that should run on the same node as block-pod-a:

$ oc create -f block-pod-c.yml

Now let’s check the status of block-pod-c:

$ oc get pods -o wide
NAME          READY   STATUS              RESTARTS   AGE     IP           NODE                    NOMINATED NODE   READINESS GATES
block-pod-a   1/1     Running             0          6m49s   10.130.6.4   infra02.lan.stderr.at   <none>           <none>
block-pod-b   0/1     ContainerCreating   0          6m19s   <none>       infra01                 <none>           <none>
block-pod-c   1/1     Running             0          14s     10.130.6.5   infra02.lan.stderr.at   <none>           <none>

Oh, block-pod-c is running on node infra02 and mounted the RWO volume. Let’s check the events for block-pod-c:

3m6s        Normal    Scheduled                pod/block-pod-c   Successfully assigned blockdevices/block-pod-c to infra02.lan.stderr.at
2m54s       Normal    AddedInterface           pod/block-pod-c   Add eth0 [10.130.6.5/23]
2m54s       Normal    Pulled                   pod/block-pod-c   Container image "registry.redhat.io/ubi8/ubi:8.3" already present on machine
2m54s       Normal    Created                  pod/block-pod-c   Created container block-pod-c
2m54s       Normal    Started                  pod/block-pod-c   Started container block-pod-c

When we compare this with the events for block-pod-a:

9m41s       Normal    Scheduled                pod/block-pod-a   Successfully assigned blockdevices/block-pod-a to infra02.lan.stderr.at
9m41s       Normal    SuccessfulAttachVolume   pod/block-pod-a   AttachVolume.Attach succeeded for volume "pvc-bd68be5d-c312-4c31-86a8-63a0c22de844"
9m34s       Normal    AddedInterface           pod/block-pod-a   Add eth0 [10.130.6.4/23]
9m34s       Normal    Pulled                   pod/block-pod-a   Container image "registry.access.redhat.com/ubi8/ubi:8.3" already present on machine
9m34s       Normal    Created                  pod/block-pod-a   Created container block-pod-a
9m34s       Normal    Started                  pod/block-pod-a   Started container block-pod-a

So the AttachVolume.Attach message is missing in the events for block-pod-c. Because the volume is already attached to the node, interesting.

Even with RWO block device volumes it is possible to use the same volume in multiple pods if the pods a running on the same node.

I was not aware of this possibility and always had the believe with an RWO block device only one pod can access the volume. That’s the problem with believing :-)

Thanks or reading this far.