Automated ETCD Backup

- By: Thomas Jungbauer ( Lastmod: 2021-11-30 )

Securing ETCD is one of the major Day-2 tasks for a Kubernetes cluster. This article will explain how to create a backup using OpenShift Cronjob.

There is absolutely no warranty. Verify your backups regularly and perform restore tests.

Prerequisites

The following is required:

  • OpenShift Cluster 4.x

  • Integrated Storage, might be NFS or anything. Best practice would be a RWX enabled storage.

Configure Project & Cronjob

Create the following objects in OpenShift. This fill create:

  1. A Project called ocp-etcd-backup

  2. A PersistentVolumeClaim to store the backups. Change to your appropriate StorageClass and accessMode

  3. A ServiceAccount called openshift-backup

  4. A dedicated ClusterRole which is able to start (debug pods)

  5. A ClusterRoleBinding between the created ServiceAccount and the customer ClusterRole

  6. A 2nd ClusterRoleBinding, which gives our ServiceAccount the permission to start privileged containers. This is required to start a debug pod on a control plane node.

  7. A CronJob which performs the backup …​ see Callouts for inline explanations.

A helm chart, which would create these objects below, can be found at: https://github.com/tjungbauer/ocp-auto-backup. This is probably a better way to manage the variables via the values.yaml file.
kind: Namespace
apiVersion: v1
metadata:
  name: ocp-etcd-backup
  annotations:
    openshift.io/description: Openshift Backup Automation Tool
    openshift.io/display-name: Backup ETCD Automation
spec: {}
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: etcd-backup-pvc
  namespace: ocp-etcd-backup
spec:
  accessModes:
    - ReadWriteOnce (1)
  resources:
    requests:
      storage: 100Gi
  storageClassName: gp2
  volumeMode: Filesystem
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: openshift-backup
  namespace: ocp-etcd-backup
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-etcd-backup
rules:
- apiGroups: [""]
  resources:
     - "nodes"
  verbs: ["get", "list"]
- apiGroups: [""]
  resources:
     - "pods"
     - "pods/log"
  verbs: ["get", "list", "create", "delete", "watch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: openshift-backup
subjects:
  - kind: ServiceAccount
    name: openshift-backup
    namespace: ocp-etcd-backup
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-etcd-backup
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: etcd-backup-scc-privileged
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:scc:privileged
subjects:
- kind: ServiceAccount
  name: openshift-backup
  namespace: ocp-etcd-backup
---
kind: CronJob
apiVersion: batch/v1
metadata:
  name: cronjob-etcd-backup
  namespace: ocp-etcd-backup
  labels:
    purpose: etcd-backup
spec:
  schedule: '*/5 * * * *' (2)
  startingDeadlineSeconds: 200
  concurrencyPolicy: Forbid
  suspend: false
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      backoffLimit: 0
      template:
        metadata:
          creationTimestamp: null
        spec:
          nodeSelector:
            node-role.kubernetes.io/master: '' (3)
          restartPolicy: Never
          activeDeadlineSeconds: 200
          serviceAccountName: openshift-backup
          schedulerName: default-scheduler
          hostNetwork: true
          terminationGracePeriodSeconds: 30
          securityContext: {}
          containers:
            - resources:
                requests:
                  cpu: 300m
                  memory: 250Mi
              terminationMessagePath: /dev/termination-log
              name: etcd-backup
              command: (4)
                - /bin/bash
                - '-c'
                - >-
                  oc get no -l node-role.kubernetes.io/master --no-headers -o
                  name | grep `hostname` | head -n 1 | xargs -I {} -- oc debug
                  {} -- bash -c 'chroot /host sudo -E
                  /usr/local/bin/cluster-backup.sh /home/core/backup' ; echo
                  'Moving Local Master Backups to target directory (from
                  /home/core/backup to mounted PVC)'; mv /home/core/backup/*
                  /etcd-backup/; echo 'Deleting files older than 30 days' ; find
                  /etcd-backup/ -type f  -mtime +30 -exec rm {} \;
              securityContext:
                privileged: true
                runAsUser: 0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - name: temp-backup
                  mountPath: /home/core/backup (5)
                - name: etcd-backup
                  mountPath: /etcd-backup (6)
              terminationMessagePolicy: FallbackToLogsOnError
              image: registry.redhat.io/openshift4/ose-cli
          serviceAccount: openshift-backup
          volumes:
            - name: temp-backup
              hostPath:
                path: /home/core/backup
                type: ''
            - name: etcd-backup
              persistentVolumeClaim:
                claimName: etcd-backup-pvc
          dnsPolicy: ClusterFirst
          tolerations:
            - operator: Exists
              effect: NoSchedule
            - operator: Exists
              effect: NoExecute
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
1 RWO is used here, since I have no other available storage on my test cluster.
2 How often shall the job be executed. Here, every 5 minutes.
3 Bind the job to "Master" nodes.
4 Command to be executed…​ It fetches the actual local master nodename and starts a debugging Pod there. The backup script is called and moves the backup to /home/core/backup which is a folder on the control plane itself. The move command will move the backups from the local folder to the actual backup target volume. Finally, it will remove backups older than 30 days.
5 Mounted /home/core/backup on the master nodes, here the command will store the backups before they are moved
6 Target destination for the etcd backup on the mounted PVC

Start a Job

If you do not want to wait until the CronJob is triggered, you can manually start the Job using the following commands:

oc create job backup --from=cronjob/cronjob-etcd-backup -n ocp-etcd-backup

This will start a Pod which will do the backup:

Starting pod/ip-10-0-196-187us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-15
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-10
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-9
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-3
etcdctl is already installed
{"level":"info","ts":1638199790.980932,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/home/core/backup/snapshot_2021-11-29_152949.db.part"}
{"level":"info","ts":"2021-11-29T15:29:50.991Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1638199790.9912837,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://10.0.196.187:2379"}
{"level":"info","ts":"2021-11-29T15:29:53.306Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
Snapshot saved at /home/core/backup/snapshot_2021-11-29_152949.db
{"level":"info","ts":1638199793.3482974,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://10.0.196.187:2379","size":"180 MB","took":2.367303503}
{"level":"info","ts":1638199793.348459,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/home/core/backup/snapshot_2021-11-29_152949.db"}
{"hash":1180914745,"revision":10182252,"totalKey":19360,"totalSize":179896320}
snapshot db and kube resources are successfully saved to /home/core/backup

Removing debug pod ...
Moving Local Master Backups to target directory (from /home/core/backup to mounted PVC)

Verifying the Backup

Let’s start a dummy Pod which can access the PVC to verify if the backup is really there.

apiVersion: v1
kind: Pod
metadata:
  name: verify-etcd-backup
spec:
  containers:
  - name: verify-etcd-backup
    image: registry.access.redhat.com/ubi8/ubi
    command: ["sleep", "3000"]
    volumeMounts:
    - name: etcd-backup
      mountPath: /etcd-backup
  volumes:
  - name: etcd-backup
    persistentVolumeClaim:
      claimName: etcd-backup-pvc

Logging into that Pod will show the available backups stored at /etcd-backup which is the mounted PVC.

oc rsh -n ocp-etcd-backup verify-etcd-backup ls -la etcd-backup
total 1406196
drwxr-xr-x. 3 root root      4096 Nov 29 17:00 .
dr-xr-xr-x. 1 root root        25 Nov 29 17:06 ..
drwx------. 2 root root     16384 Nov 29 15:21 lost+found
-rw-------. 1 root root 179896352 Nov 29 15:21 snapshot_2021-11-29_152150.db
-rw-------. 1 root root 179896352 Nov 29 15:29 snapshot_2021-11-29_152949.db
-rw-------. 1 root root 179896352 Nov 29 15:32 snapshot_2021-11-29_153159.db
-rw-------. 1 root root 179896352 Nov 29 15:36 snapshot_2021-11-29_153618.db
-rw-------. 1 root root 179896352 Nov 29 15:55 snapshot_2021-11-29_155513.db
-rw-------. 1 root root 179896352 Nov 29 16:00 snapshot_2021-11-29_160020.db
-rw-------. 1 root root 179896352 Nov 29 16:55 snapshot_2021-11-29_165521.db
-rw-------. 1 root root 179896352 Nov 29 17:00 snapshot_2021-11-29_170020.db
-rw-------. 1 root root     89875 Nov 29 15:21 static_kuberesources_2021-11-29_152150.tar.gz
-rw-------. 1 root root     89875 Nov 29 15:29 static_kuberesources_2021-11-29_152949.tar.gz
-rw-------. 1 root root     89875 Nov 29 15:32 static_kuberesources_2021-11-29_153159.tar.gz
-rw-------. 1 root root     89875 Nov 29 15:36 static_kuberesources_2021-11-29_153618.tar.gz
-rw-------. 1 root root     89875 Nov 29 15:55 static_kuberesources_2021-11-29_155513.tar.gz
-rw-------. 1 root root     89875 Nov 29 16:00 static_kuberesources_2021-11-29_160020.tar.gz
-rw-------. 1 root root     89875 Nov 29 16:55 static_kuberesources_2021-11-29_165521.tar.gz
-rw-------. 1 root root     89875 Nov 29 17:00 static_kuberesources_2021-11-29_170020.tar.gz