Monitoring Tells You When
Monitoring checks known metrics against thresholds: CPU over 80%, error rate over 1%, latency over 500ms. It tells you that something is wrong but not why. Observability answers why by letting you explore unknown unknowns.
Monitoring is for known failure modes: disk full, service down, queue backed up. Observability is for unknown failure modes: why is latency spiking on Tuesdays? Why does the checkout flow fail for 0.1% of users? Why does memory usage grow slowly over weeks? These questions cannot be answered by predefined metrics.
The Three Pillars
- Metrics: Aggregate measurements over time. CPU, memory, request rates. Efficient for dashboards and alerts. Pre-aggregated and optimised for time-series queries. Limitation: you must know what to measure in advance.
- Logs: Discrete events with context. Essential for debugging specific issues but expensive at scale. High-cardinality, unstructured, and comprehensive. Limitation: volume. A busy service generates terabytes of logs daily.
- Traces: Request paths through distributed systems. The key to understanding latency and dependencies in microservices. Limitation: implementation effort. Every service must be instrumented.
The three pillars are complementary. Metrics provide the overview. Logs provide the detail. Traces provide the context. A trace ID in a log message connects the log to the trace. A metric tagged with a service name connects the metric to the service.
Distributed Tracing Implementation
Start with OpenTelemetry. Instrument your services with auto-instrumentation libraries. Send traces to Jaeger or Tempo for storage and querying. Correlate traces with logs using trace IDs. The effort is front-loaded but the debugging capability is transformative.
Trace-to-log correlation is the final piece. Every log message should include the trace ID, span ID, and span name. When you find an error in a trace, you can search logs by trace ID to find all related log messages. This correlation reduces debugging time from hours to minutes.
Our Recommendation
Implement metrics first, logs second, traces third. Traces provide the highest value in distributed systems but require the most setup. Start tracing when you have more than three services or when debugging latency becomes painful.
Metrics are the foundation: easy to implement, cheap to store, immediate value. Logs are the next layer: structured logging and centralised collection provide essential debugging detail. Traces are the top layer: instrumentation across all services provides the distributed context that makes debugging complex systems possible.