Fault Injection

- By: Thomas Jungbauer ( Lastmod: 2021-08-14 ) - 3 min read

Tutorial 8 of OpenShift 4 and Service Mesh tries to cover Fault Injection by using Chaos testing method to verify if your application is running. This is done by adding the property HTTPFaultInjection to the VirtualService. The settings for this property can be for example: delay, to delay the access or abort, to completely abort the connection.

"Adopting microservices often means more dependencies, and more services you might not control. It also means more requests on the network, increasing the possibility for errors. For these reasons, it’s important to test your services’ behavior when upstream dependencies fail." [1]

Preparation

Before we start this tutorial, we need to clean up our cluster. This is especially important when you did the previous training Limit Egress/External Traffic.

oc delete deployment recommendation-v3
oc scale deployment recommendation-v2 --replicas=1
oc delete serviceentry worldclockapi-egress-rule
oc delete virtualservice worldclockapi-timeout

Verify that 2 pods for the recommendation services are running (with 2 containers)

oc get pods -l app=recommendation -n tutorial
NAME                                 READY   STATUS    RESTARTS   AGE
recommendation-v1-69db8d6c48-h8brv   2/2     Running   0          4d20h
recommendation-v2-6c5b86bbd8-jnk8b   2/2     Running   0          4d19h

Abort Connection with HTTP Error 503

For the first example, we will need to modify the VirtualService and the DestinationRule. The VirtualService must be extended with a http fault section, which will abort the traffic 50% of the time.

  1. Create the VirtualService

    apiVersion: networking.istio.io/v1alpha3
    kind: VirtualService
    metadata:
      name: recommendation
    spec:
      hosts:
      - recommendation
      http:
      - fault:
          abort:
            httpStatus: 503
            percent: 50
        route:
        - destination:
            host: recommendation
            subset: app-recommendation

    Apply the change

    oc replace -f VirtualService-abort.yaml
    Existing VirtualService with the name recommendation will be overwritten.
  2. Create the DestinationRule

    apiVersion: networking.istio.io/v1alpha3
    kind: DestinationRule
    metadata:
      name: recommendation
    spec:
      host: recommendation
      subsets:
      - labels:
          app: recommendation
        name: app-recommendation

    Apply the change

    oc replace -f destinationrule-faultinj.yaml
    Existing Destination with the name recommendation will be overwritten.
  3. Check the traffic and verify that 50% of the connections will end with a 503 error:

    export INGRESS_GATEWAY=$(oc get route customer -n tutorial -o 'jsonpath={.spec.host}')
    sh ~/run.sh 1000 $GATEWAY_URL

Clean Up

oc delete virtualservice recommendation

Test slow connection with Delay

More interesting, in my opinion, to test is a slow connection. This can be tested by adding the fixedDelay property into the VirtualService. Like in the example below, we will use a VirtualService. This time delay instead of abort is used. The fixDelay defines a delay of 7 seconds for 50% of the traffic.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: recommendation
spec:
  hosts:
  - recommendation
  http:
  - fault:
      delay:
        fixedDelay: 7.000s
        percent: 50
    route:
    - destination:
        host: recommendation
        subset: app-recommendation

If you now send traffic into the application, you will see that some answers will have a delay of 7 seconds. Keep sending traffic in a loop.

Even more visible it will be, when you goto "Distributed Tracing" at the Kiali UI, select the service recommendation and a small lookback of maybe 5min. You will find that some requests are very fast, while other will tage about 7 seconds.

Kiali delayed traffic
Figure 1. Jaeger with delayed traffic.

Retry on errors

If a microservice is answering with an error, Service Mesh/Istio will automatically try to reach another pod providing the service. These retries can be modified. In order to make everything visible, we will use Kiali to monitor the traffic.

 

  1. We start by sending traffic into the application. This should be split evenly between v1 and v2 of the recommendation microservice

    sh ~/run.sh 1000 $GATEWAY_URL
    # 8329: customer => preference => recommendation v1 from 'f11b097f1dd0': 11145
    # 8330: customer => preference => recommendation v2 from '3cbba7a9cde5': 9712
    # 8331: customer => preference => recommendation v1 from 'f11b097f1dd0': 11146
    # 8332: customer => preference => recommendation v2 from '3cbba7a9cde5': 9713
    # 8333: customer => preference => recommendation v1 from 'f11b097f1dd0': 11147

    In Kiali this ia visible in the Graphs, using the settings: "Versioned app graph" and "Requests percentage"

    Kiali retry traffic split 50
    Figure 2. Traffic is split by 50% between recommendation v1 nd v2
  2. As second step we need to enable the nasty mode for the microservice v2. This will simulate an outage, respoding with error 503 all the time. This change must be done inside the container:

    oc exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash
  3. Inside the container use the following command and exit the container again

    curl localhost:8080/misbehave

    Kiali will now show that v1 will get 100% of the traffic, while v2 is shown as red. When you select the red square of v2 and then move the mouse over the red cross for the failing application, you will see that the pd itself is ready, but that 100% of the traffic is currently failing.

    Kiali retry traffic retry
    Figure 3. Traffic for v2 is failing
  4. revert the change and fix v2 service

    oc exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash
    curl localhost:8080/behave

    Verify in Kiali that everything is "green" again and that the traffic is split by 50% between v1 and v2.