Observability: Beyond Monitoring

Monitoring Tells You When

Monitoring checks known metrics against thresholds: CPU over 80%, error rate over 1%, latency over 500ms. It tells you that something is wrong but not why. Observability answers why by letting you explore unknown unknowns.

Monitoring is for known failure modes: disk full, service down, queue backed up. Observability is for unknown failure modes: why is latency spiking on Tuesdays? Why does the checkout flow fail for 0.1% of users? Why does memory usage grow slowly over weeks? These questions cannot be answered by predefined metrics.

Three Observability Capabilities: High-cardinality data (query by any dimension), ad-hoc exploration (ask new questions without predefining metrics), and correlation (connect metrics, logs, and traces for a single request).

The Three Pillars

Metrics: Aggregate measurements over time. CPU, memory, request rates. Efficient for dashboards and alerts. Pre-aggregated and optimised for time-series queries. Limitation: you must know what to measure in advance.
Logs: Discrete events with context. Essential for debugging specific issues but expensive at scale. High-cardinality, unstructured, and comprehensive. Limitation: volume. A busy service generates terabytes of logs daily.
Traces: Request paths through distributed systems. The key to understanding latency and dependencies in microservices. Limitation: implementation effort. Every service must be instrumented.

The three pillars are complementary. Metrics provide the overview. Logs provide the detail. Traces provide the context. A trace ID in a log message connects the log to the trace. A metric tagged with a service name connects the metric to the service.

Distributed Tracing Implementation

Start with OpenTelemetry. Instrument your services with auto-instrumentation libraries. Send traces to Jaeger or Tempo for storage and querying. Correlate traces with logs using trace IDs. The effort is front-loaded but the debugging capability is transformative.

OpenTelemetry: Auto-instrumentation for popular frameworks (Express, Django, Spring Boot) and manual APIs for custom code. Adds minimal overhead (typically under 5% CPU and memory). Provides comprehensive tracing: HTTP requests, database queries, message queue operations, and external API calls.

Trace Propagation: Every incoming request must extract the trace context from headers (traceparent, tracestate). Every outgoing request must inject the trace context into headers. This creates a continuous trace across service boundaries. Without propagation, traces are fragmented and useless.

Backend Choice: Jaeger provides a rich UI for trace visualisation, dependency analysis, and performance trends. Tempo (from Grafana Labs) is more cost-efficient for high-volume environments, using object storage for trace data. We recommend Jaeger for small to medium environments, Tempo for high-volume.

Trace-to-log correlation is the final piece. Every log message should include the trace ID, span ID, and span name. When you find an error in a trace, you can search logs by trace ID to find all related log messages. This correlation reduces debugging time from hours to minutes.

Our Recommendation

Implement metrics first, logs second, traces third. Traces provide the highest value in distributed systems but require the most setup. Start tracing when you have more than three services or when debugging latency becomes painful.

Metrics are the foundation: easy to implement, cheap to store, immediate value. Logs are the next layer: structured logging and centralised collection provide essential debugging detail. Traces are the top layer: instrumentation across all services provides the distributed context that makes debugging complex systems possible.

Typical Timeline: Metrics in week 1, logs in week 2, traces in month 2. Start with application metrics (request rate, latency, error rate) and infrastructure metrics (CPU, memory, disk, network). Add structured logging with correlation IDs. Then add distributed tracing when the pain of debugging cross-service issues exceeds the setup effort.

Monitoring Tells You When

The Three Pillars

Distributed Tracing Implementation

Our Recommendation

Production issues hard to debug?