Distributed Systems

Jaeger Tracing for Cross-Service Debugging

Jaeger turns a slow request across many services into one visual trace. How distributed tracing works, what to instrument, and the sampling tradeoff that bites.

Part of Observability for Distributed Systems
Jaeger tracing, shown as a distributed-trace waterfall of spans with one amber slow span

Jaeger tracing exists to answer one question that logs and metrics cannot: when a request crossed eight services and was slow, which service was slow? It stitches per-service timing data into a single end-to-end trace, so a problem that used to take hours of cross-team log archaeology becomes one glance at a waterfall. The catch nobody mentions up front is sampling, which quietly decides whether the trace you need was ever recorded.

In a monolith, a profiler tells you where time went. In a distributed system, the time is scattered across services, machines, and teams. Distributed tracing reassembles that scattered timing into one picture, and it is close to indispensable once you pass a handful of services.

Why distributed tracing matters

Three signals make up observability: logs (what happened), metrics (how much, how often), and traces (where the time went across services). Most teams have the first two and discover they cannot answer the third when a request is mysteriously slow.

The failure pattern is familiar: a dashboard shows p99 latency spiked, but the latency is spread across a call chain, and each service’s logs only show its own slice. Without tracing, you are correlating timestamps across teams by hand. With it, you open the trace and the slow span is right there. This post is part of the Observability series and pairs with timeout budgets across service chains, which the traces here help you debug.

What problem does Jaeger tracing solve?

Jaeger answers “where did the time go” for a request spanning many services. Every service attaches a shared trace ID and records its own span; Jaeger collects those spans and assembles them into one end-to-end timeline. Instead of guessing which service caused a slow request, you see the exact hop and how long it took.

This is the capability that does not exist without tracing. Logs are per-service and per-line; metrics are aggregate. Neither reconstructs the path of a single request across the system. The trace does, and that reconstruction is what turns a multi-team debugging session into a single lookup.

What is the difference between a trace and a span?

A span is one unit of work within one service, carrying a name, a start time, a duration, and metadata (tags, logs, status). A trace is the tree of spans for a single request as it travels across services. The span is one stop; the trace is the whole journey, and the parent-child links between spans show the call structure.

Concretely: a request hits your API gateway (span 1), which calls an auth service (span 2), which calls a database (span 3), then the gateway calls a pricing service (span 4). Those four spans, linked by one trace ID with parent-child relationships, render as a waterfall where the width of each bar is its duration. The widest bar is your problem.

Should I use Jaeger or OpenTelemetry?

Use both, because they do different jobs. OpenTelemetry is the vendor-neutral standard for instrumenting your code and exporting telemetry; Jaeger is a backend that ingests, stores, and visualizes traces. The modern pattern is to instrument with OpenTelemetry and export to Jaeger, which gives you Jaeger’s UI without coupling your code to it.

The practical implication is to standardize on OpenTelemetry for instrumentation across every service and language, then treat the tracing backend as a swappable detail. In a polyglot system this matters even more, because OpenTelemetry gives you one instrumentation model across all your languages instead of a different client per runtime.

What should you instrument?

Trace the boundaries first: every inbound request, every outbound call to another service, every database query, and every external API call. Those edges are where time is spent and where one service hands off to another, so they are where a trace earns its value. Propagating the trace context across each boundary is what keeps the spans stitched into one trace.

The non-negotiable is context propagation. A trace only works if the trace ID travels with the request across every hop, which means each service must read the incoming trace context and pass it onward. This is the same cross-service plumbing that, done wrong, produces the boundary failures in Why Language Boundaries Break Polyglot Microservices; propagation that silently drops at one language boundary leaves you with broken, half-length traces.

How much should I sample traces?

Sample based on what you need to see, not a flat percentage. Tracing every request is expensive in storage and overhead at scale, but naive head-based sampling (decide at the start, keep 1%) will usually throw away the rare slow or failed request you most need. Tail-based sampling decides after the full trace is collected, so you can keep all the errors and slow traces and drop the boring fast ones.

The sampling tradeoff is the part teams underestimate, and it directly determines whether tracing helps during an incident.

  • Head sampling (fixed rate): cheap and simple, but it decides before it knows whether the trace is interesting. The 1-in-100 you kept is probably a normal request; the slow one got dropped.
  • Tail sampling: buffers the whole trace, then keeps it if it errored, exceeded a latency threshold, or is otherwise notable. More infrastructure, far better signal.

For a system where you are tracing to catch the rare bad request, tail-based sampling is usually worth the extra moving parts, because it guarantees the traces you actually open are the ones that matter.

What is the performance overhead of tracing?

Small per request, but real in aggregate, which is exactly why sampling exists. Creating and exporting spans adds a little CPU and memory per request, and shipping every trace at high traffic adds meaningful network and storage cost. The instrumentation overhead is rarely the problem; the volume of trace data is.

This reframes overhead as two separate concerns. The in-process cost of generating spans is typically negligible relative to the work the request is already doing, so you do not avoid tracing to save CPU. The export and storage cost, however, scales with how many traces you keep, and that is where naive “trace everything” gets expensive fast at high request rates.

The resolution is the sampling decision from the previous section, plus exporting asynchronously so tracing never sits in the request’s critical path. Batch and export spans off the hot path, sample intelligently so you store the interesting traces and drop the rest, and the overhead becomes a rounding error against the debugging time it saves. Tracing you turned off to save a few percent CPU is tracing you do not have during the incident that would have paid for it many times over.

A tracing setup checklist

Before you rely on Jaeger in an incident:

  • Instrumentation is OpenTelemetry-based, not backend-specific.
  • Trace context propagates across every service boundary, verified end to end (no broken traces).
  • Inbound requests, outbound calls, DB queries, and external APIs are all spanned.
  • Sampling keeps errors and slow traces (tail-based if you can run it), not just a flat percentage.
  • Spans carry useful tags (status codes, key IDs) without leaking sensitive data.
  • Trace retention and storage cost are sized deliberately; traces are voluminous.
  • The team knows how to go from a latency alert to the relevant trace quickly.

What I’d do differently

The mistake I have made is treating tracing as a checkbox: install it, see a few traces in the demo, move on. Then an incident hits, you open Jaeger, and the trace you need was sampled away, or it ends abruptly because context propagation broke at one service. Tracing that you have not validated under real conditions is tracing you cannot trust when it counts.

If I were rolling out tracing again, I would validate two things before declaring it done: that a trace survives end to end across every service without breaking, and that the sampling strategy actually keeps slow and failed requests. Get those right and Jaeger becomes the first place you look during a latency incident. Get them wrong and it becomes a dashboard you stop trusting. The dashboards that complement it are the subject of Grafana Dashboards for Operators, Not Executives.

Sources

Frequently asked questions

What problem does Jaeger tracing solve?

Jaeger answers "where did the time go" for a request that crosses many services. Logs and metrics tell you a request was slow; a distributed trace shows you which hop in the chain was slow, by stitching per-service spans into one end-to-end timeline keyed by a shared trace ID.

What is the difference between a trace and a span?

A span is one unit of work in one service, with a start time, duration, and metadata. A trace is the tree of spans for a single request as it flows across services. The trace shows the whole journey; each span shows one stop on it.

Should I use Jaeger or OpenTelemetry?

They are complementary. OpenTelemetry is the vendor-neutral standard for instrumenting your code and exporting trace data; Jaeger is a backend that stores and visualizes it. Instrument with OpenTelemetry, export to Jaeger, and you avoid lock-in while getting Jaeger's UI.

How much should I sample traces?

Sampling is a cost-versus-visibility tradeoff. Tracing every request is expensive at scale, but aggressive head sampling can drop the rare slow request you most need. Tail-based sampling, which decides after seeing the whole trace, lets you keep the interesting traces (errors, slow ones) and drop the boring ones.

Newsletter

Liked this breakdown?

Production wisdom on distributed systems, delivered when there is something worth saying. No spam, unsubscribe anytime.