Planning for Failure using Chaos Engineering
|Countdown link||Open timer|
Failure is just a fact of running software in production. But you can prepare for it in a controlled environment using Chaos Engineering techniques. In this talk, you will learn to make use of chaos engineering practices to plan and assess your production readiness. This means that you are more ready for the failure scenarios, when, not if, they happen.
When you introduce possible breakage scenarios in your infrastructure in a controlled fashion, it gives you an opportunity to assess your production readiness.
What happens if your cloud provider's entire availability zone goes down?
When did you last try restoring from your backup?
Did the expected alerts fire? Were the alerts tuned right? Or would it wake your engineer up at 3.00 AM for no reason.
Did you have dashboards setup for the most important metrics? Did you even have metrics for that thing you care about?
By practicing chaos engineering in your software infrastructure, you are essentially planning for failure. You plan and assess your readiness for the failure scenarios so that you are equipped to better handle them, when, not if, they happen.
Amit Saha is a senior site reliability engineer at Atlassian based in Sydney, Australia. He’s the author of two books, including Doing Math with Python, and several other publications.