Chaos Testing with Istio

Written by Consultant James Mak, at Airwalk Reply.

However, even though K8s + Istio have got lots of useful features, it is always better to prepare for the worst before the worst comes. This is where Chaos testing comes in. We try to explore what will happen when different components in our system break. Of course, this will be carried out in a controlled environment, we will devise ways to break the system. For example, reduce infrastructure capacity, create high load in compute resource, create network outage, application failure, etc. All common or uncommon outage scenarios that you think of can be included in your Destroyer plan.

On the other hand, we also need our Savior repair strategy to get things restored once Doomsday occurs. We need to experiment with this plan and assess whether it returns our configuration to a stable state as we would want. Hence we build confidence that the service mesh can tolerate failing nodes and can prevent localised failures from cascading to other nodes.

It’s becoming popular for enterprise IT to hold a Game Day to get their IT expertise ‘rehearsed’ in such situations.

Technically speaking, Envoy, an open source lightweight proxy is the building block of Istio. Envoy works alongside the Kubernetes workload pod. It acts as a gateway between the workload pod and the Kubernetes mesh. Envoy intercepts all inbound and outbound traffic to and from the app workload. Hence we can use Envoy to manipulate the traffic by using its versatile routing features.

In the following, I will focus on using Istio to carry out Chaos testing, where some network delay and HTTP error response will be introduced to emulate network issues in microservice-based applications.

Prerequisites

  1. Basic knowledge in Istio



The client request call will first reach Istio Ingress Gateway which matches the Virtual Service and Destination Rule (if any). Based on the routing configuration, the request will be dispatched to the Backend.

Istio provides two kinds of HTTP failure injection at Virtual Service level, they are namely,

  1. HTTP delay fault
  2. HTTP abort fault

We can use HTTP delay fault to introduce network latency when the request reaches the Ingress Gateway. The envoy proxy response flag will be set to DI indicating that the request processing was delayed for a period specified via fault injection. With more granular control, you can specify what percentage of traffic you want to delay. Following is an example YAML file for creating a virtual service injecting a five second delay to ALL matched virtual service traffic.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: test-vs
spec:
  hosts:
  - backend
  http:
  - fault:
      delay:
        percentage:
          value: 100
        fixedDelay: 5s
    route:
    - destination:
        host: backend
  gateways:
  - ingress-gateway

Next comes the HTTP abort fault. Following is an example YAML where HTTP response code “500 — Internal Server Error” will be returned to the client for matched traffic. The envoy proxy response flag will be set to FI indicating that the request is aborted with a response code specified.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: test-vs
spec:
  hosts:
  - backend
  http:
  - fault:
      abort:
        httpStatus: 500
        percentage:
          value: 100
    route:
    - destination:
        host: backend
  gateways:
  - ingress-gateway

You can use Istio Virtual Service to do Chaos testing at the application layer transparently, by injecting timeouts or HTTP errors into your services, without actually updating your app code. Testing the system in distress to ensure its resilience is extremely important for modern microservice applications with little tolerance for downtime.

For a more orchestrated Chaos Engineering platform, Chaos Mesh will be a choice. It not only does Network Chaos, but is also able to carry Pod Chaos, DNS Chaos, IO Chaos, etc. and visualises the operation.