August 28, 2019
Before we set any background for an explanation of the term ‘Chaos Engineering,’ we believe that you must possess some good knowledge regarding the importance of Software Testing & QA in the software development cycle.
And this quote proves what it takes to design a bunch of good software test cases.
“No amount of testing can prove software right; a single test can prove software wrong.”— Amir Ghahrai
That’s where intelligent test cases and concepts come into a picture.
In this time of incremental releases and agile development approach, continuous examination of software is imperative to offer a seamless, faultless, and consistent experience to the users. At present, tech organizations are implementing modern and automated ways to test software thoroughly and completely.
Taking modern tech infrastructure and distributed systems in an account, testing seems trickier and more difficult to examine it from every standpoint. Therefore, organizations are developing smartest ways to test their systems from unit to functional to the infrastructure testing.
One such intelligent concept, introduced by Netflix is ‘Chaos Engineering’ and this post talks about the same.
Yes, you heard it right.
Chaos engineering concept is introduced by Netflix, one of the largest media subscription services which have around 150 million paid subscriptions worldwide.
Before we understand this concept, here is a brief explanation of terms we are going to use in this blog:
In modern application life cycle, there are four environments that are used by the tech companies around the globe to develop software.
Now, understanding the concept of chaos engineering would be easy.
To define it in simplest terms, chaos engineering is a disciplined approach to identifying vulnerabilities in systems in the production environment.
It is implemented to check the system’s reliability, stability, and capability to survive against all unstable and unexpected conditions.
When we consider large-scale distributed systems, there are numerous chances of failures including application failure, network failure, infrastructure failure, dependency failure, and so on.
Moreover, the system is being developed in micro components and deployed on cloud-enabled architecture, making it more prone to point of failures and outages.
Here are some points that justify why Chaos Engineering:
The first and fundamental concept of testing is, it has several sets of inputs and predicted outputs to obtain desired system behaviors. It has limited scopes as it does not generate any completely new knowledge about how the system will behave if something could go wrong.
Opposite to this, chaos engineering allows performing wide, careful and unpredicted experiments that generate new knowledge about the system’s behaviors, properties, and performance. It has a wider scope and unplanned combinations to observe the system very closely with various study formats.
Chaos experiments are limitless, creating more opportunities to test the system from every point of view. You can create intentionally chaos to check whether a system can withstand it or not.
This is more than a discipline or principle.
It all started when large-scale distributed systems were the talk of the town. It was difficult to test the resilience of the system in a distributed environment. Here, resilience not only means the system’s ability for failures but ensuring maximum quality of the systems.
In 2011, Netflix decided to move from physical infrastructure to cloud to provide users with a better video streaming experience. The Netflix Eng Tools team came up with an innovative idea to test the fault tolerance of the system without any impact on customer service.
They created the ‘Chaos Monkey’ tool which is inspired by the idea of a monkey who enters the farm and randomly destroys the objects.
According to Netflix Technology Blog, the definition of Chaos Monkey goes like this:
Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.
Unlike the physical environment, the cloud move of Netflix is assumed to have more breakdowns since it is abstract and distributed in nature.
The reason behind running the Chaos Monkey tool in the Netflix system is simple:
The cloud is all about redundancy and fault tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. [Source]
After the success of the ‘Chaos Monkey’ tool, Netflix Team has created a suite of tools that supports chaos engineering principles and named ‘Simian Army’ to check reliability and resiliency of AWS infrastructure.
These all chaos tools are constantly testing the system against all kinds of failures, building a higher level of confidence in the system’s ability to survive.
You might say this, “We are not Netflix and we don’t have any large-scale system and huge customer base like Netflix.”
That’s true. But, over time, it has evolved and is not limited to one organization or a digital-native company like Netflix. There are many companies with huge customer bases that are dedicated to offering a seamless experience to their users. And to ensure consistent performance and constant availability, healthcare, educational, finance organizations are implementing chaos experiments.
Chaos in distributed systems requires two groups to control and monitor the activities – the experimental group that experiments and the control group that deals with the effects of experiments.
If an engineering team can find weaknesses in the system, then it is called a successful chaos experiment else they expand their hypothetical boundaries.
When weaknesses are found, the team is bound to address and fix those issues before they turn out to be system-wide troubles.
Note: As chaos experiments are in a production environment or closer to the production environment, there are chances that customer experience might get affected. So, it is always wise to plan the smallest experiments and be ready to carefully handle the impact.
A distributed system usually tends to have more failure points due to its complexity and large-scale nature.
Chaos engineering tries to discover those failure points and identify what will happen in the case of resource or object unavailability.
This is a very suitable practice in modern software development approaches like DevOps and Microservices Architecture.
Today, not just Netflix, but many giant organizations are using it to ensure that a system can withstand any breakdowns, and later on, they fix the issues in the system during chaos experiments.
DevOps is all about continuous improvement and frequent releases.
Chaos principles are the best approach to test a system’s ability against failures when it comes to DevOps-driven software development. System architects and testers are in a hurry to release the software and you can find ‘unknown’ conditions when you perform chaos engineering in distributed, continuous-changing, and complex development methodologies.
We have seen drastic changes in software development frameworks and methods in the last few years. Monolithic has been replaced by cloud and microservice architecture to build the software at high velocity.
Here also, chaos works best since it has the potential to identify dependency failure or conjunction failure points that are common in the microservice structure of the system.
Failure is a success if we learn from it.
This quote makes much more sense to understand the idea behind chaos principles. You need to learn from the failure to improve your system, make it more resilient, and increase your confidence in the system’s capabilities.
There are many tools available for chaos and many organizations are experimenting with different techniques and tools to make it a more mature and useful approach. By intentionally creating chaos in the system, an organization can achieve long-term software resiliency. Resiliency and quality are considered important factors when we talk about distributed systems with faster release cycles.
Did you find ‘chaos engineering’ interesting? What are your thoughts? If you want to know more about chaos engineering and failure-as-a-service, stay tuned!
SPEC INDIA, as your single stop IT partner has been successfully implementing a bouquet of diverse solutions and services all over the globe, proving its mettle as an ISO 9001:2015 certified IT solutions organization. With efficient project management practices, international standards to comply, flexible engagement models and superior infrastructure, SPEC INDIA is a customer’s delight. Our skilled technical resources are apt at putting thoughts in a perspective by offering value-added reads for all.