Chaos Engineering: A Comprehensive Guide


August 28, 2019

Chaos Engineering

Before we set any background for an explanation of the term ‘Chaos Engineering,’ we believe that you must possess some good knowledge regarding the importance of Software Testing & QA in the software development cycle.

And this quote proves what it takes to design a bunch of good software test cases.

“No amount of testing can prove software right; a single test can prove software wrong.”— Amir Ghahrai

That’s where intelligent test cases and concepts come into a picture.

In this time of incremental releases and agile development approach, continuous examination of software is imperative to offer a seamless, faultless, and consistent experience to the users. At present, tech organizations are implementing modern and automated ways to test software thoroughly and completely.

Taking modern tech infrastructure and distributed systems in an account, testing seems trickier and more difficult to examine it from every standpoint. Therefore, organizations are developing smartest ways to test their systems from unit to functional to the infrastructure testing.

One such intelligent concept, introduced by Netflix is ‘Chaos Engineering’ and this post talks about the same.

Chaos Engineering: Infrastructure Testing In Netflix Way

Yes, you heard it right.

Chaos engineering concept is introduced by Netflix, one of the largest media subscription services which have around 150 million paid subscriptions worldwide.

Before we understand this concept, here is a brief explanation of terms we are going to use in this blog:

In modern application life cycle, there are four environments that are used by the tech companies around the globe to develop software.

  • Development Environment: A program is developed/coded on the development system
  • Test Environment: Product is copied to the testing environment and tested carefully to make it perform like target environment
  • Acceptance Test Environment: Client tests the system and verifies whether it meets the expectations or not
  • Production Environment: If a customer accepts the system, it is copied to the production environment, i.e. live environment

Now, understanding the concept of chaos engineering would be easy.

What Is Chaos Engineering?

To define it in simplest terms, chaos engineering is a disciplined approach to identifying vulnerabilities in systems in the production environment.

It is implemented to check the system’s reliability, stability, and capability to survive against all unstable and unexpected conditions.

When we consider large-scale distributed systems, there are numerous chances of failures including application failure, network failure, infrastructure failure, dependency failure, and so on.

Moreover, the system is being developed in micro components and deployed on cloud-enabled architecture, making it more prone to point of failures and outages.

Why There’s A Need For Chaos Engineering? Is It Different From Testing?

Here are some points that justify why Chaos Engineering:

  • It improves the resilience of the system
  • You will get to know the weaknesses of the system
  • It is proactive in nature, opposite to the reactive nature of traditional testing
  • It exposes hidden threats and minimizes the risks

Difference Between Chaos Engineering And Testing

The first and fundamental concept of testing is, it has several sets of inputs and predicted outputs to obtain desired system behaviors. It has limited scopes as it does not generate any completely new knowledge about how the system will behave if something could go wrong.

Opposite to this, chaos engineering allows performing wide, careful and unpredicted experiments that generate new knowledge about the system’s behaviors, properties, and performance. It has a wider scope and unplanned combinations to observe the system very closely with various study formats.

Chaos experiments are limitless, creating more opportunities to test the system from every point of view. You can create intentionally chaos to check whether a system can withstand it or not.

This is more than a discipline or principle.

Who Introduced Chaos Engineering?

It all started when large-scale distributed systems were the talk of the town. It was difficult to test the resilience of the system in a distributed environment. Here, resilience not only means the system’s ability for failures but ensuring maximum quality of the systems.

A Brief History Of Chaos Engineering:

In 2011, Netflix decided to move from physical infrastructure to cloud to provide users with a better video streaming experience. The Netflix Eng Tools team came up with an innovative idea to test the fault tolerance of the system without any impact on customer service.

They created the ‘Chaos Monkey’ tool which is inspired by the idea of a monkey who enters the farm and randomly destroys the objects.

According to Netflix Technology Blog, the definition of Chaos Monkey goes like this:

Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.

Unlike the physical environment, the cloud move of Netflix is assumed to have more breakdowns since it is abstract and distributed in nature.

The reason behind running the Chaos Monkey tool in the Netflix system is simple:

The cloud is all about redundancy and fault tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. [Source]

After the success of the ‘Chaos Monkey’ tool, Netflix Team has created a suite of tools that supports chaos engineering principles and named ‘Simian Army’ to check reliability and resiliency of AWS infrastructure.

List Of Tools Developed By Netflix:
  • Chaos Monkey
  • Latency Monkey
  • Doctor Monkey
  • Conformity Monkey
  • Janitor Monkey
  • Security Monkey
  • Chaos Gorilla
  • 10–18 Monkey

These all chaos tools are constantly testing the system against all kinds of failures, building a higher level of confidence in the system’s ability to survive.

How Can You Practice Chaos Engineering Into Your System?

You might say this, “We are not Netflix and we don’t have any large-scale system and huge customer base like Netflix.”

That’s true. But, over time, it has evolved and is not limited to one organization or a digital-native company like Netflix. There are many companies with huge customer bases that are dedicated to offering a seamless experience to their users. And to ensure consistent performance and constant availability, healthcare, educational, finance organizations are implementing chaos experiments.

Four Basic Steps To Perform Chaos Engineering:

Chaos in distributed systems requires two groups to control and monitor the activities – the experimental group that experiments and the control group that deals with the effects of experiments.

  • Define ‘steady-state’ that represents the normal behavior of a system
  • Chaos engineers hypothesize expected outcomes when something goes wrong
  • Designing experiments with variables to reflect real-world events like dependency failure, server failure, network or memory malfunction, and so on.
  • Measuring the impact of test and observing the difference of steady-state in both the groups

If an engineering team can find weaknesses in the system, then it is called a successful chaos experiment else they expand their hypothetical boundaries.

When weaknesses are found, the team is bound to address and fix those issues before they turn out to be system-wide troubles.

Note: As chaos experiments are in a production environment or closer to the production environment, there are chances that customer experience might get affected. So, it is always wise to plan the smallest experiments and be ready to carefully handle the impact.

Ultimate Goal Of Chaos Engineering: Discover the “What-If” Scenario

A distributed system usually tends to have more failure points due to its complexity and large-scale nature.

Chaos engineering tries to discover those failure points and identify what will happen in the case of resource or object unavailability.

This is a very suitable practice in modern software development approaches like DevOps and Microservices Architecture.

Today, not just Netflix, but many giant organizations are using it to ensure that a system can withstand any breakdowns, and later on, they fix the issues in the system during chaos experiments.

Companies Who Are Using Chaos Tools:
  • Facebook
  • Google
  • Microsoft
  • Amazon
  • Twilio
  • LinkedIn

Chaos Engineering And DevOps: Better Understand Your System Amidst Frequent Releases

DevOps is all about continuous improvement and frequent releases.

Chaos principles are the best approach to test a system’s ability against failures when it comes to DevOps-driven software development. System architects and testers are in a hurry to release the software and you can find ‘unknown’ conditions when you perform chaos engineering in distributed, continuous-changing, and complex development methodologies.

Chaos Engineering And Microservices: Understand It, Learn It, Improve It

We have seen drastic changes in software development frameworks and methods in the last few years. Monolithic has been replaced by cloud and microservice architecture to build the software at high velocity.

Here also, chaos works best since it has the potential to identify dependency failure or conjunction failure points that are common in the microservice structure of the system.

Chaos Engineering: More Than Preventive Mechanism

Failure is a success if we learn from it.  

This quote makes much more sense to understand the idea behind chaos principles. You need to learn from the failure to improve your system, make it more resilient, and increase your confidence in the system’s capabilities.

There are many tools available for chaos and many organizations are experimenting with different techniques and tools to make it a more mature and useful approach. By intentionally creating chaos in the system, an organization can achieve long-term software resiliency. Resiliency and quality are considered important factors when we talk about distributed systems with faster release cycles.

Did you find ‘chaos engineering’ interesting? What are your thoughts? If you want to know more about chaos engineering and failure-as-a-service, stay tuned!

Delivering Digital Outcomes To Accelerate Growth
Let’s Talk

SPEC INDIA, as your single stop IT partner has been successfully implementing a bouquet of diverse solutions and services all over the globe, proving its mettle as an ISO 9001:2015 certified IT solutions organization. With efficient project management practices, international standards to comply, flexible engagement models and superior infrastructure, SPEC INDIA is a customer’s delight. Our skilled technical resources are apt at putting thoughts in a perspective by offering value-added reads for all.

Delivering Digital Outcomes To Accelerate Growth
Let’s Talk