Why SRE Pipelines Break at Remote Sites (and How to Fix Them)

Strengthening observability beyond MELT and Golden Signals

Woman reviewing metrics data

If you know what “o11y” is, you probably know what a site reliability engineer (SRE) does. Maybe you are one. Observability pipelines help SREs monitor and understand system behavior by using data such as metrics, events, logs, and traces (MELT) and Golden Signals. But as information technology (IT) infrastructure grows more complex and shifts toward the edge, what happens when observability surfaces issues without revealing the cause?

“If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.”
— “Site Reliability Engineering: How Google Runs Production Systems,” by Betsy Beyer et al

MELT data helps SREs support key functions such as deployment pipeline automation, infrastructure and application monitoring, performance analysis, and reporting. Each part plays a different role:

  • Metrics provide time-series numerical data, such as CPU utilization, request latency, and throughput
  • Events capture changes in system state, such as deployments, crashes, or failovers
  • Logs contain detailed, time-stamped output from applications and infrastructure components
  • Traces follow the path of individual requests through services, showing how long each step takes and where failures occur

SREs rely on MELT to track service-level indicators (SLIs) and service-level objectives (SLOs), support incident response, and validate system performance.

But is that still true? Or are MELT-based strategies starting to “melt” under pressure from modern, decentralized systems?

In edge environments, where infrastructure is not always monitored on-site and is harder to instrument, MELT data may be incomplete.

At a remote retail location, for instance, point-of-sale (POS) systems might experience lag due to Domain Name System (DNS) resolution issues or packet loss. There may be no logs, no clear traces, and metrics may appear normal.

This limitation isn’t unique to MELT. Golden Signals, part of the observability stack, face similar challenges.

Golden Signals and How SREs Use Them

The concept of Golden Signals was introduced by Google’s SRE team in their 2014 book “Site Reliability Engineering: How Google Runs Production Systems.” It defines four key indicators to assess whether a service is operating as expected: latency, traffic, errors, and saturation. Latency measures response time. Traffic reflects volume and load. Errors capture failed requests or invalid responses. Saturation indicates how close the system is to capacity.

However, Golden Signals rely on the same assumption as MELT: that telemetry data is complete, accurate, and available. When that fails, especially in remote or distributed environments, neither framework can fully explain what’s happening. This includes missing details such as host-level conversations, infrastructure-specific errors, and quality-of-service (QoS) misconfigurations.

Even Good Data Can Be Undermined by Flaws in the Pipeline

The problem isn’t always the data. Sometimes, it’s the pipeline. MELT and Golden Signals are foundational, but in distributed environments, they can’t always reflect real conditions at the edge, even if artificial intelligence (AI) is used to support decision-making. Teams also need additional sources of insight, including:

  • Packet-level visibility that reveals retransmissions, time-outs, misconfigurations, application-specific details, and service dependencies not visible in MELT data
  • Synthetic testing at the edge that simulates user activity to proactively detect emerging performance issues
  • Correlation across core, cloud, and edge to trace issues across service boundaries and isolate root causes quickly

This approach strengthens the observability pipeline by delivering insights that reflect real user experience more accurately.

Building a Robust Observability Pipeline That Works at the Edge

SRE teams often struggle to identify and resolve issues at the edge when observability lacks context. NETSCOUT’s nGenius Edge Sensors enhance observability with NETSCOUT Smart Data and synthetic testing, creating a more complete and accurate pipeline that improves performance and reduces resolution time across the entire IT infrastructure.

Read this case study to see how an SRE helped a manufacturer reduce mean time to resolution (MTTR) at remote factory sites.