June 23, 2022

Chaos Testing an Application on AWS

Sindhuja Durai, Vice President, Consumer Engineering

Well-maintained applications usually have generous test coverage, spread across unit tests, integration tests and performance test suites. Despite this coverage, incidents in production can be due to unfavorable operational scenarios ranging from infrastructure and network faults to unexpected traffic patterns. Circumstances like underlying database failures, domain name resolution failures or compute node failures have resulted in high Mean Time To Detect (MTTD) and Mean Time To Recover (MTTR), thereby breaching the application’s Service Level Objectives (SLO). These incidents have highlighted the gaps in the existing test patterns, timing and verbiage of the monitoring as well as engineers’ unfamiliarity with system behavior under various operational faults. The nature of these failures are such that they cannot be covered as part of the unit, integration or performance test suites. Resilience tests are used to assert on the application’s ability to keep up with its promised SLOs in the face of incidents. This allows developers to reason about application behavior in undesirable operational scenarios in a data-driven manner, assess operational impact, evaluate mitigation strategies and execute improvements in a controlled environment.

While the resiliency test suite helps identify opportunities to improve the overall MTTD/MTTR of the application, their manual executions tend to be time consuming, one-off, infrequent exercises requiring greater effort in planning. This leads to application observability and incident runbooks being frequently outdated and makes scaling the tests across an entire application cumbersome, especially since applications evolve at a faster pace than the frequency of the test execution. Performing the execution through standard Continuous Integration and Continuous Delivery (CI/CD) pipelines enables engineers to continuously assess the resiliency of the system in Software Development Life Cycle (SDLC), thus allowing detection of resiliency gaps in a timely fashion. Executing resilience tests as part of the pipeline ensures that they can be scheduled frequently with minimal effort. Additionally, they can be used as a control gate by application developers to flag resiliency weakness introduced as part of a deployment. For this exercise, we leveraged ChaosToolkit, an open source chaos engineering tool to orchestrate the test execution, along with GitLab as the CI/CD platform. While the toolkit itself comes with built-in support for fault injection into multiple cloud provider offerings, we have also authored additional plugins for custom faults.

Resilience Test Suite Execution

The System Under Test (SUT) in this case is an architecture on AWS that consists of 7 micro-services spread across 5 AWS accounts and multiple Virtual Private Clouds (VPCs). The architecture entails Fargate Clusters, Aurora PostGres databases, ElasticSearch Cluster, Lambdas and Network Load Balancers apart from connectivity to external dependencies through VPC Endpoints. The application has promised Service Level Objectives (SLO) of 99.99% availability and p95 latency of 50ms.

A dedicated sandbox environment as a AWS VPC is set up for the resilience test execution. This environment is setup, torn down via Infrastructure as Code (IaC) as part of the pipeline. The resilience test execution is invoked from the pipeline and uses a short-lived access token to inject chaos into the application. The test execution environment is ephemeral, which ensures that it has elevated privileges over the SUT only for the duration of the test execution and thereby reduces the risk to the application. Failure of a test results in the failure of the pipeline. Once the experiments are executed, the results are uploaded to the test result repository by the pipeline. This also serves as an audit trail of all fault injections.

GitLab Pipeline diagram illustrating the flow of Chaos testing experiments. Described in detail in the text of the article.
GitLab Pipeline diagram illustrating the flow of Chaos testing experiments. Described in detail in the text of the article.

The chaos simulated in the system can be of varying blast radius and are categorized as below.

  • Infrastructure Faults: Compute node failures are simulated by bringing down one or more Elastic Container Service (ECS) task instances.  Underlying database failures are simulated by bringing down or failing over the database cluster. The impact of infrastructure fault can range from a small blast radius (a single compute node failure) to a much wider blast radius (an entire availability zone/region failure). 
  • Network Faults: The application’s VPC endpoint connectivity is tampered with to affect egress to a dependency. The system behavior can help identify if the application degrades gracefully and if there are any unintended dependencies that result in complete application failure. Network faults can be used to simulate connectivity failures to internal or external dependencies.
  • User Traffic Patterns: Different user input patterns like increasing concurrent users, spiky traffic patterns, retry storms, traffic with varying payload sizes can be simulated. k6 is integrated with ChaosToolkit and can be leveraged to simulate these patterns. This helps to certify if the application is able to handle unexpected traffic patterns gracefully.

Game Days

Game days are dedicated days that involve getting a team together in a room, either virtually or in person. A scenario is chosen from a suite of pre-identified resilience tests. The specific nature of the fault is not disclosed to the team and is executed. The application is observed to see if the system behavior is as expected and when it is not, engineers debug through through the scenario to restore the application to its expected performance within the promised SLOs. If the application is identified to be not resilient to specific operational outages, these gaps can be fixed before the weaknesses impact customers. Runbooks are also updated to keep up with the application and any process gaps in incident handling are highlighted. Leveraging the resiliency test suite setup, game days were conducted for the application on AWS. While the resiliency suite execution though the CI/CD pipeline allows for continuous resiliency certification of the application, game days help engineers become better equipped to handle similar incidents in the Production environments. By conducting frequent game days, they become increasingly familiar with the application ecosystem – dependencies and resilience gaps, as well as the steps to triage issues.

Resilience Test Findings

The findings from the resilience tests and game days can be characterized into the following criteria.

  1. Architecture Improvements: Simulating the failure of an external dependency identified an underlying risk that would silently allow malicious traffic into the system. This prompted a review of the integration pattern between the application and the dependency that would allow the optimal exposure of the application. Infrastructure faults can help identify single points of failures in the application and if redundancy can be introduced in the application to improve this. For example, simulating database failures can highlight the need of a caching layer in the application to improve availability and latency.
  2. Observability, Monitoring & Capacity Enhancement: Simulating the compute node failure identified long health check probe intervals (30 seconds) leading to delayed MTTD and high error rate (0.14%). By reducing the probe interval, the error rate was reduced to 0.07%. Simulating connectivity issues from a single compute instance in the cluster resulted in an increased error rate of 10%, highlighting the need for robust and frequent health checks by the load balancer. Simulating external dependency failures highlighted the lack of robust health checks and therefore increased MTTD of the failure. By simulating varying high load patterns on the application using k6, it was identified that scale up configuration was not robust, whose improvement lead to 50% increase in cluster capacity scale up time. Monitoring should be set up at the appropriate levels of sensitivity, such that routine fluctuations in traffic throughout the day does not result in alert noise. As part of game days, with engineers handling issues, the monitoring and observability gaps of the application are easily identified. The application should have the right observability tooling to help triage issues like increased latency or failures and debug outages, thereby reducing the overall MTTR. The alarm thresholds should be set just right for improved MTTD. 
  3. Detection of Outdated Runbook & DocumentationThe external dependency fault injection highlighted that the escalation methods and contacts for an external dependency should be well documented to reduce the MTTR of the system. The steps followed to investigate, mitigate or remediate an issue have to be updated in the runbooks. Outdated runbooks significantly increase the MTTR of the application. 

Summary

Chaos testing helped identify resilience gaps, iteratively measure and make improvements, improved the overall stability of the plant and made the test execution repeatable with minimal effort. Using chaos tests, the application’s ability to fulfill its SLOs was evaluated. Chaos tests also helped to identify unintended startup dependencies and single points of failure in the application. With periodic, automated, up-to-date resilience tests executed as part of game days, when faced with unexpected outages, engineers have the training to handle incidents.

Explore our careers page to learn more about exciting engineering opportunities at Goldman Sachs.


See https://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest, and other terms and conditions relating to this blog and your reliance on information contained in it.