API Observability: Logging, Metrics, and Distributed Tracing
An API that works in development and fails in production in ways you cannot diagnose is worse than an API that fails loudly. Production is where assumptions meet reality, and the gap between them only becomes visible if you can see what is happening inside the system. Observability is the discipline of making distributed systems understandable — building in enough visibility that when something goes wrong, you can find out what, where, and why without guessing.
The three pillars are logging, metrics, and distributed tracing. Each answers different questions and each is necessary.
Logging: What Happened
Logs are timestamped records of discrete events. They answer “what happened?” — which requests arrived, what errors occurred, what decisions the system made. Logs are the raw material of debugging.
Structured logging — JSON-formatted log lines with consistent fields rather than free-text strings — is the prerequisite for logs that are actually queryable. A log line like:
{
"timestamp": "2026-05-02T10:30:15.423Z",
"level": "error",
"request_id": "req_f4e8b2d1",
"user_id": "usr_abc123",
"method": "POST",
"path": "/api/payments",
"status": 500,
"duration_ms": 342,
"error_code": "upstream_timeout",
"service": "payment-service"
}
is queryable on every field. A query for all 500 errors from the payment service in the last hour is trivial. A query for all requests from a specific user is trivial. A query for requests slower than 500ms on a specific endpoint is trivial.
Free-text log lines require regex parsing and guesswork. Structured log lines are first-class data.
Log at the right level. At minimum: every request (method, path, status, duration, request ID), every error (with enough context to reproduce the conditions), and significant business events (payment processed, user created, job completed). Do not log sensitive data — request bodies containing credentials, personal data, or payment information should not appear in logs. Log the request ID that a customer can provide when reporting an issue.
Retain logs long enough to investigate incidents. A support ticket filed three weeks after the fact requires three weeks of log retention to investigate. Define retention policies based on investigation needs and regulatory requirements.
Metrics: How the System Is Behaving
Metrics are aggregated numerical measurements over time. They answer “how is the system behaving?” — request rate, error rate, response time distribution, database query latency, cache hit ratio, queue depth. Metrics are the material of dashboards, alerts, and capacity planning.
The four essential API metrics, sometimes called the RED metrics, are:
Rate: requests per second, by endpoint and status code. This is the baseline signal for traffic patterns. Spikes or drops are the first indicator of something anomalous.
Errors: the proportion of requests resulting in 5xx responses. A sustained increase in error rate is the most direct signal that something is broken. Separate 4xx from 5xx — client errors are not server failures and should not trigger the same alerts.
Duration: response time distribution, expressed as percentiles. p50 (median) tells you typical performance. p95 tells you what most users experience. p99 tells you the tail — the worst 1% of requests. Mean response time hides outliers; percentiles reveal them. A p99 of 10 seconds on an endpoint with a p50 of 100ms is a problem that the mean conceals.
Plus one more for thorough coverage — Saturation: how full the system’s capacity is. Database connection pool utilization, thread pool exhaustion, memory pressure, CPU usage. These are leading indicators of future problems — they rise before error rates do.
Instrument your API to emit these metrics on every request, store them in a time-series database (Prometheus is the standard in self-hosted environments; cloud providers offer equivalent managed services), and build dashboards that show them at a glance.
Alerts should be on symptoms, not causes. Alert on error rate exceeding 1%, on p99 latency exceeding a defined threshold, on traffic dropping to near zero (often more alarming than a spike). Do not alert on CPU usage hitting 70% — alert on the user-visible symptom that high CPU causes. The goal of alerting is to be woken up when users are experiencing problems, not when a server metric crosses an internal threshold.
Distributed Tracing: Following a Request Across Services
In a multi-service architecture, a single user request may pass through an API gateway, an authentication service, a business logic service, a database, an external API, and back. When that request is slow or fails, logs from each service are disconnected. You know something went wrong; you do not know which service in the chain was the problem or how much time each step took.
Distributed tracing solves this by assigning every request a unique trace ID that propagates through every service in the chain. Each service adds a span to the trace — a record of when it received the request, what it did, and when it responded. The trace aggregator assembles spans into a complete picture of the request’s journey through the system.
The trace for a slow API request might show: 15ms in the gateway, 5ms in auth, 820ms in the payment service (with a sub-span showing 810ms waiting for an external payment processor API call), 10ms back through the gateway. The 820ms problem is immediately isolated to the payment service and specifically to the external API call. Without tracing, you have five services’ logs to correlate manually.
Implement tracing by generating a trace ID on ingress, including it in every outbound request as a header (the standard header is traceparent per the W3C Trace Context specification), and logging the trace ID alongside every log line. In a cohesive instrumentation setup, a trace ID links a specific user-facing failure to the exact span across every service that handled it.
OpenTelemetry is the current standard for instrumentation — a vendor-neutral SDK for emitting traces, metrics, and logs that can be sent to any compatible backend (Jaeger, Zipkin, Honeycomb, Datadog, and others). Instrumenting with OpenTelemetry avoids lock-in to a specific observability vendor.
Request IDs: The Foundation
Every API request should receive a unique request ID generated at ingress and included in every log line, every response header, and every error message. The request ID is the thread that connects a support report to the logs, a customer complaint to the trace, an error in one service to the original request that caused it.
X-Request-ID: req_f4e8b2d1a3c7e9b2
If the caller provides a request ID (some clients do for their own tracking), include it in logs alongside the server-generated one. Treat the request ID as load-bearing infrastructure — once you have it, you build everything else around it.
Observability is not an after-the-fact concern. The cost of retrofitting observability into a running system is significantly higher than building it in from the start. An API instrumented correctly from day one gives you the visibility to understand what it is doing before the first production incident reveals what you cannot see.