Is It Safe-to-Fail for DevOps?



For organizations like yours, the upside of DevOps is clear: increased productivity, faster time to deployment and accelerated remediation. As these stats from Puppet’s 2016 State of DevOps report show, high-achieving IT organizations understand the need for speed.

  • High achiever IT departments deploy 200 times faster, with lead times that are 2,555 times faster.
  • Their change failure rates are three times lower, and their recovery times are 24 times faster.
  • They spend half as much time remediating security issues and 22 percent less time on unplanned work and rework.

Yet DevOps isn’t all rainbows and unicorns. New ways of doing things unleash new challenges. Increased speed requires you to tackle application and service performance management in a whole new way.

Production Pressures

Continuous deployment and continuous delivery are two different beasts, according to Jacob Ukelson of Lecture Monkey. Continuous Deployment deals only with the technical aspects of moving workloads efficiently from one environment in the pipeline to the next, ensuring the new environment is properly set up and configured.

Continuous Delivery, on the other hand, factors also business impacts into the considerations. For example, if the velocity of delivering new digital services into production environments is accelerated but their quality is poor, then the overall level of Continuous Delivery is degrading rather than improving. Therefore, accelerating the continuous deployment pipeline is not necessarily effective from a business perspective. Unless the production environment is truly fail-safe, accelerating the continuous deployment pipeline would not help businesses to improve their agility. It would simply shift the bottleneck of delivering value to the customers, through app development, into the production environment.


By delivering applications into a safe-to-fail production environment the developers can accelerate their velocity since they don’t need to worry about creating fail-safe applications. As David Snowden of Cognitive Edge defines it, safe-to-fail means that no matter what happens, you can survive the consequences and recover.

But in a chaotic production environment, acting without deep visibility into your system and applications would hardly qualify as safe-to-fail.

Continuous System-Level Assurance

Production environments get exposed to human error. After all, it’s not economical to maintain a separate production environment to test your deployment automation processes. You have to gain system-level visibility into your environment. Or, as we say at NETSCOUT, you need the ability to “see the whole of the moon.”

System-level visibility is comprised of all the subsystems, including the applications, which are instrumental in the service delivery to consumers. Our continuous system-level performance management platform monitors all the wire-data traversing the critical service delivery links of your infrastructure. It proactively detects service degradation and gives you insight into the subsystems and their interdependencies. You get the telemetry of load, latency and failure metrics across all the sub-systems, including server, applications, service enablers, network and databases. Instead of just seeing what’s going wrong at the application level, you can get to the root cause at the system-level. This insight helps slash mean time to repair (MTTR) by tracing multiple problems to a single root-cause and establish a production environment that is truly safe-to-fail.

As Snowden reminds us, “Don’t start any experiment, safe-to-fail or otherwise, unless you can monitor its impact in real time.”

See how organizations like yours are bringing order to chaos with continuous system-level assurance.