Chaos Engineering | IEEE Journals & Magazine | IEEE Xplore

Abstract:

Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Many large tech organizations are using experimentation to ...Show More

Abstract:

Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Many large tech organizations are using experimentation to verify such systems' reliability. Netflix engineers call this approach chaos engineering. They've determined several principles underlying it and have used it to run experiments. This article is part of a theme issue on DevOps.
Published in: IEEE Software ( Volume: 33, Issue: 3, May-June 2016)
Page(s): 35 - 41
Date of Publication: 18 March 2016

ISSN Information:

References is not available for this document.

Shifting to a System Perspective

At the heart of chaos engineering are two premises. First, engineers should view a collection of services running in production as a single system. Second, we can better understand this system's behavior by injecting real-world inputs (for example, transient network failures) and observing what happens at the system boundary.

Select All
1.
J. Gray, Why Do Computers Stop and What Can Be Done about It?, tech. report 85.7, Tandem Computers, 1985.
2.
C. Bennett and A. Tseitlin, “Chaos Monkey Released into the Wild,” Netflix Tech Blog, 30 July 2012 ; http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.
3.
K. Andrus, N. Gopalani, and B. Schmaus, “FIT: Failure Injection Testing,” Netflix Tech Blog, 23 Oct. 2014 ; http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html.
4.
J. Robbins et al., “Resilience Engineering: Learning to Embrace Failure,” ACM Queue, vol. 10, no. 9, 2012 ; http://queue.acm.org/detail.cfm?id=2371297.
5.
H. Nakama, “Inside Azure Search: Chaos Engineering,” blog, Microsoft, 1 July 2015 ; https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering.
6.
Y. Sverdlik, “Facebook Turned Off Entire Data Center to Test Resiliency,” Data Center Knowledge, 15 Sept. 2014 ; www.datacenterknowledge.com/archives/2014/09/15/facebook-turned-off-entire-data-center-to-test-resiliency.
7.
W.R. Shadish, T.D. Cook, and D.T. Campbell, Experimental and Quasi-experimental Designs for Generalized Causal Inference, 2 nd ed., Wadsworth, 2001.
8.
P. Fisher-Ogden, C. Sanden, and C. Rioux, “SPS: The Pulse of Netflix Streaming,” Netflix Tech Blog, 2 Feb. 2015 ; http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html.
9.
D. Yuan et al., “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems,” Proc. 11th USENIX Symp. Operating Systems Design and Implementation (OSDI 14), 2014 ; www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf.
10.
K. Parrish and D. Halsey, “Too Big to Test: Breaking a Production Brokerage Platform without Causing Financial Devastation,” presentation at Velocity 2015 Conf., 2015 ; http://conferences.oreilly.com/velocity/devops-web-performance-ny-2015/public/schedule/detail/45012.
11.
S. Newman, Building Microservices, O’Reilly Media, 2015.

Contact IEEE to Subscribe

References

References is not available for this document.