DevOps

Print
Share

Enhancing DevOps Agility and Protecting the Deployment Pipeline:

With Business Assurance Solutions Designed for the Digital Era

» Metzler on DevOps

The Big Picture

In the Digital Era, digital services are at the center of disruptive innovation. The agility of continuous planning, delivery, integration, testing, and deployment of applications and services mark the difference between winners and losers in this highly competitive business environment. While automation of these processes offers tremendous benefits for Continuous Delivery, it shifts the constraint to the production environment that now gates the overall flow of the value stream to customers. Unfortunately the application-level telemetry utilized by DevOps teams for the feedback loop is inefficient since many of the constraints are at the system level. These include all architectural subsystems associated with service delivery and the application. NETSCOUT’s system-level visibility, telemetry and triage capabilities based on continuous monitoring and analysis of IP Traffic empower the DevOps organizations to become more agile and efficient and help their businesses to achieve a competitive advantage in the digital battlefield.

The key challenges that DevOps is facing and the respective NETSCOUT solution described on this page are summarized in Table 1 below.

Table 1: Addressing DevOps Challenges

DevOps ChallengesNETSCOUT Value Proposition
Become more productive and agile by delivering services faster, and with fewer resourcesEnable DevOps to accomplish more with fewer resources with service-level telemetry platform and common situational awareness
Minimize the impact of disruption Reduce the mean time to repair (MTTR)
Continuously analyze the global IT resource capacity and readiness to deploy new servicesEffective feedback loop based on real-time and continuous monitoring and analysis of the global service delivery resources capacity
Assure service quality, security, and availabilityBusiness Assurance solutions to mitigate risks associated with service quality, security, and availability

The DevOps Agility Conundrum

The DevOps is at the epicenter of this disruptive innovation and its performance is directly linked to the corporate business outcomes. Unfortunately the higher the deployment pipeline accelerates the more resistance it encounters from the growing “chaos” created due the increased speed of operation. While automation and adoption of agile leadership principles helps control this chaos for Continuous Delivery, it also increases the chaos in the production environment. As a result, the Operations teams are now at a risk of becoming the next bottleneck that restricts the overall flow of the value stream to customers. The first step the Operations team has to undertake to address this challenge is to gain a continuous and real-time visibility based on system-level telemetry. The Ops team needs to use this insight to reduce the MTTR and establish an effective feedback loop with Dev, QA, Sec and Ops. This objective is extremely difficult if not impossible to accomplish with traditional application performance management (APM) technologies such as agents and bytecode instrumentation. The reason is that what APM tools reveal is the crescent, just the application-level telemetry, as opposed to of the entire system-level end-to-end telemetry. NETSCOUT Business Assurance (BA) solutions utilize IP Traffic-based technologies to address this challenge and help DevOps gain system-level visibility to protect the deployment pipeline at increased speed of operation. As illustrated in Figure 1 below this approach is more agile and efficient since any instrumentation point of IP Traffic offers insight across all applications and their respective metrics without the need for bytecode instrumentation of each individual application.

Assuring DevOps Deployment
Figure 1: Assuring the Deployment Pipeline with System-Level Visibility

Seeing “the whole of the moon”[1]

NETSCOUT Business Assurance solutions empower the DevOps organizations to see “the whole of the moon” by continuously monitoring the IP Traffic that traverses the service delivery infrastructure, proactively detecting service degradations, and providing actionable insight into all services interdependencies necessary to reduce the MTTR and resolve issues before users are affected.

This end-to-end system-level visibility includes telemetry of load, latency and failure metrics for all service delivery systems and interdependencies amongst network, server, service enablers, databases and applications. This insight not only helps improve the speed of continuous planning, delivery, integration, testing, and deployment, but also optimizes DevOps efficiencies and achieves a competitive advantage in the digital battlefield.

Optimizing DevOps Operational Efficiencies

While seamless communication between Dev and Ops teams is a prerequisite to increased DevOps productivity, it is not enough. Even if, hypothetically speaking, the DevOps team could achieve a fully transparent common situational awareness across the Dev and Ops teams, a precise analysis of the “situation” will influence the efficiency of this common awareness. If the situation analysis could quickly identify the root cause at the system level across all relevant IT systems and the application, it would not only reduce drastically the MTTR, but also serve as a force multiplier that enables DevOps to accomplish more work with fewer resources. Furthermore, it would be much more efficient to achieve this MTTR reduction without the need for the Dev team to perform bytecode instrumentation for each and every application.

For example, since only a portion of the service delivery problems are related to a specific application, the developers’ productivity is optimized if they are engaged only when the root cause is related to their specific application. With application-level visibility on the other hand, the Development team would not know if the root cause of an application performance issue is related to their application code or another IT system. This will result in a waste of valuable time by the Dev team who need to help troubleshoot non-application related issues. The Ops team would also end up spending more time troubleshooting the root cause due to lack of visibility into the interdependencies across the IT systems and the applications. This results in creation of “Inefficiency Zones” for both Dev and Ops due to the wasted time and effort as illustrated in Figure 2 below.

APM Stages Telemetry
Figure 2: The Benefits of Migrating to System-Level Telemetry: High Agility and Enhanced Efficiency

The system-level telemetry approach utilizes effective system triage based on end-to-end visibility across all service delivery interdependencies to quickly identify the root cause of service issues. The mean time it takes an IT organization to complete the triage process is called mean time to knowledge (MTTK) and according to ZK Research, MTTK accounts for 90% of the overall mean time required to repair (MTTR) a service performance problem.

The system-level telemetry approach relies on performance metrics across the entire service delivery infrastructure that spans physical and virtual, on premises and off-premises and private and public clouds. It offers a unique ability to analyze performance, traffic indicators, load and failures as well as offer contextual workflows to quickly triage and find the root cause of issues causing application performance degradation. An effective service triage can significantly accelerate MTTR by up to 80% which allows the Development teams to focus most of their time and effort productively on delivering new applications and reduces the overhead on operations associated with break-fix activities. This includes reduced time spent in the war room and reduced operations, and support cost and complexity. The bottom line is that with system-level telemetry, the DevOps organizations can see “the whole of the moon,” [2] improve speed and optimize efficiencies.

4outof5

System-Level Telemetry Foundation: Smart Data and Superior Analytics

Smart Data

While achieving a system-level visibility across all applications and service delivery systems and their interdependencies may sound like a tall order, it is doable with smart data and superior analytics. The IP Traffic data is the foundation of the smart data and is used to generate a highly scalable metadata that delivers a real-time and historic telemetry of all system components including physical and virtual networks, n-tier applications, workloads, protocols, servers, databases, users, and devices. The key benefits of using IP Traffic data include:

  • System-Level and Real-Time Telemetry – since every action and transaction is encapsulated in IP packets that traverse the physical and virtual infrastructure, IP Traffic data offers the best vantage point for end-to-end visibility
  • Actionable Intelligence – IP Traffic contains all the data[3] necessary to gain in-depth understanding of application and system performance management issues
  • Application Agnostic Insight – IP Traffic data can be used to monitor any traditional, mobile, custom, or standard application independent of the source code and with no need for agents or bytecode instrumentation
  • Highest Scalability – the standards-based IP technology is well structured and therefore most suitable for scalable system triage which requires to continuously collect, normalize, correlate, organize, and analyze large volumes of data in a system contextual fashion
Combining Smart Data with Superior Analytics
Figure 3: System-Level Telemetry Foundation: Combining Smart Data with Superior Analytics

Superior Analytics

When smart data is combined with superior analytics, it can reveal important insight on applications and service performance metrics such as application traffic volumes, application server response times, server throughputs, aggregate error counts, and error codes specific to application servers and domain. Furthermore, smart data can reveal all application dependencies and support contextual transitioning across multiple layers of analysis facilitating efficient hand-off of incident response tasks across the different IT functional groups throughout the root cause triage process. As such the handoff to the respective Development team becomes only necessary if the root cause is associated with the specific application they delivered.

The final stage of the DevOps optimization can be accomplished with predictive analysis that proactively detects service degradations before multiple users are affected. By automatically establishing performance baselines, alerts can be generated either based on predefined thresholds or baseline deviations. Deviations include rising and falling link utilization, application transaction failure rates, and responsiveness. The analytics engine also needs to automatically adjust the baselines over time to adapt to gradual changes in service utilization while delivering timely alerts on performance anomalies. By utilizing these predictive analytics, the DevOps organizations can obtain visibility into emerging service performance issues before they impact multiple users, and contextually triage and analyze alert evidence and underlying causes. The overall result of utilizing smart data and superior analytics is a drastic reduction in MTTK and MTTR as illustrated in Figure 4 below.

Reducing MTTR
Figure 4: Effective System Triage Utilizing System-Level Telemetry and Superior Analytics

Additional benefits of the system-level telemetry and triage include better service availability and user experience and ability to scale services to support millions of users in production environment.

NETSCOUT Solutions to DevOps Needs

Operational Efficiencies Solutions

NETSCOUT nGeniusONE Service Assurance platform drastically reduces the MTTR and MTTK and serves as force multiplier that enables Ops to accomplish more with fewer resources by reducing the unplanned work to a minimum. This capability also maximizes developers’ productivity by reducing the overhead of dealing with problems related to other IT systems. Core efficiencies are realized with nGeniusONE by:

  • Proactively detecting service degradations either based on deviations from performance baselines or predefined thresholds
  • Supporting intuitive top-down system triage workflows that effectively reduce the MTTK by detecting the root cause across all system, including variety of n-tier applications, IT infrastructure systems and all their respective interdependencies
  • Monitoring any legacy and new applications and infrastructure systems

Agile Continuous Deployment Solutions

NETSCOUT nGeniusONE platform offers real-time and trend-analysis related telemetry[4] and analytics to provide a feedback loop that protects the deployment pipeline and increases DevOps agility. These capabilities include:

  • Granular insight into all service delivery systems and root cause analysis factoring all the interdependencies across applications and on premise and cloud-based infrastructure.
    • The Adaptive Service Intelligence (ASI) Plus technology running on InfiniStreamNG appliances offers real-time visibility into DevOps systems and interdependencies including physical and virtual networks, n-tier applications, workloads, protocols, servers, databases, users, and devices
    • The ASI Plus technology utilizes IP Traffic as the source of smart data which is ideal for monitoring micro-services and complex on premise or cloud environments
    • ASI eXtender (ASI-X) enables to quickly instrument and monitor any custom application
  • Customizable dashboards, reports and service dependency maps help establish a common situational awareness across Dev, QA and Ops teams and streamline the feedback loop
  • The nGeniusPULSE complements the nGeniusONE capabilities with visibility for application service assurance across the diverse combinations of private, hybrid, SaaS and public cloud architectures enterprises are deploying today and are therefore critical to the deployment pipeline

DevOps Planning

The nGeniusONE platform empowers the Ops teams to reduce the service delivery infrastructure reliability risk associated with continuous deployment by:

  • Real-time and continuous analysis of the global service delivery resources capacity, before the Ops team accepts work from Dev.
  • This includes an automated service dependency map that offers insight into load, latency and failures across the entire service delivery infrastructure as well as infrastructure capacity at a network, link and server levels
  • The addition of nGeniusPULSE with its ability to test cloud-based services for availability, responsiveness, and adherence to service levels, gives DevOps the insight into the reliability risk associated with continuous deployment

Business Risks Mitigation

NETSCOUT Business Assurance solutions help mitigate business risks and achieve desirable outcomes. This is accomplished with:

  • A suite of nGeniusONE and ASI-based service assurance solutions that help reduce the MTTR increase service quality and availability
  • Arbor Networks, the security division of NETSCOUT, helps protect service integrity and availability with distributed denial of service (DDoS) and advanced persistent threat (APT)

Summary

In the Digital Era, the DevOps organization can make the difference between corporate success and failure. The key DevOps success factors are agility, operational efficiencies, and the ability to reduce business risks that may prevent the corporation from achieving the desirable business outcomes. While automation and implementation of agile principles by DevOps helped improve the speed and efficiency of Continuous Delivery, the production environment has become the new constraint in flow of the value stream to customers. This constraint cannot be effectively relaxed by Operations teams that rely on application-level visibility and Development teams that need to perform bytecode instrumentation for each application. NETSCOUT’s system-level visibility, telemetry and triage capabilities based on continuous monitoring and analysis of IP Traffic empower the DevOps organizations to become more agile and efficient and help their businesses to achieve a competitive advantage in the digital battlefield.

RESOURCES


[1] From The Whole of the Moon lyrics – The Waterboys
[2] From The Whole of the Moon lyrics – The Waterboys
[3] Open Systems Interconnection (OSI) model Layers 2 through 7
[4] Telemetry is the terminology used by DevOps and includes business, application, and infrastructure metrics required to monitor how systems operate in production environments
NetScout tagline