OpenTelemetry in production: Stop debugging in the dark

The first time you really need observability is not when you're calmly looking at a dashboard. It's when a user writes "checkout is slow", the error graph looks normal and in the logs you only find a row of disconnected messages.

OpenTelemetry was created to avoid that moment: not to have more graphics, but to connect the pieces. A request enters the API, calls a database, goes through an external provider, posts a queued job, and maybe fails three services later. Without distributed tracing, you reconstruct that story by hand. With OpenTelemetry at least you have a map.

The point isn't the trace, it's the story

A trace is a sequence of span. Put like that it sounds cold. In practice, each span is a piece of the story: POST /checkout, SELECT inventory, call payment provider, publish order.created.

The value comes when you start answering real questions:

which external service is slowing down?
do the errors come from a specific version?
does the problem affect everyone or just one tenant?
is a retry hiding a timeout?
the asynchronous job starts but then dies somewhere else?

These questions cannot be solved by a console.log thrown in a hurry. Indeed, often the log added in an emergency helps you today and becomes noise tomorrow.

How would I put this in an app Node.js

The healthiest setup is simple: the app produces telemetry, the Collector decides where to send it.

Node.js app -> OpenTelemetry Collector -> backend di observability

Why not export directly to the vendor? Because at first it seems faster, then you realize that each service has different configurations, different retries, different filters and no central point to remove sensitive data or change destination.

The Collector is boring in all the right ways. It receives OTLP, does batching, can filter, can do sampling, can add common attributes and can export to multiple systems.

Self-instrumentation: good, but not enough

In Node.js I would start with auto-instrumentation. It gives you immediate visibility into HTTP, supported frameworks, databases and common libraries.

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Then you initialize the SDK before the rest of the app:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

However, this sees the framework, not your product. It knows you made a query, but it doesn't know that query was in "create order" or "renew subscription". For that you need span manuals at the points where dominance counts.

const span = tracer.startSpan('checkout.create_order');

try {
  span.setAttribute('cart.items_count', input.items.length);
  const order = await createOrder(input);
  span.setAttribute('order.id', order.id);
  return order;
} catch (error) {
  span.recordException(error as Error);
  throw error;
} finally {
  span.end();
}

I wouldn't put span manuals anywhere. I would put them where, at three in the morning, I would like to understand what happened without reading half the code base.

Three rules that avoid a lot of chaos

First rule: each service must have service.name, environment and version. It seems trivial, but without these attributes a trace is much less useful. When a deploy breaks something, you want to filter by version in two seconds.

Second rule: don't put sensitive data in the attributes. Emails, tokens, integer payloads, and addresses should not end up in an observability backend by accident. If you need to identify a user, consider internal IDs, hashing, or less sensitive fields.

Third rule: pay attention to cardinality. user.id as an attribute of trace can make sense. As a metric label it can destroy your costs and performance.

Metrics: few, but good

I would start with very practical metrics:

rates, errors and duration of requests;
latency of external dependencies;
number of timeouts and retries;
depth of the tails;
job duration;
percentage of errors per version.

The rest is added when needed. Dashboards full of graphs that no one looks at are furniture, not observability.

Logs: Still useful, but linked

Logs don't disappear. They simply become much more useful when they carry trace_id and span_id. So you can start from an error log and open the trace, or start from a slow trace and read only the logs produced in that path.

Without correlation, you are looking for needles. With correlation, at least you know which drawer to look in.

The checklist I would use before saying "we're covered"

The trace actually cross multiple services.
Logs include trace_id and span_id.
The Collector is configured with batching and memory limits.
Errors are recorded in span.
There is a sampling policy.
Metrics have controlled cardinality.
Sensitive data is filtered.
Alerts start from user symptoms, not random graphs.

Conclusion

OpenTelemetry does not solve production problems on its own. But the way you deal with them changes. Instead of blindly adding logs, you start following the actual path of a request.

For me the sign that it's working is simple: when something happens, the team stops asking "where are we looking?" and starts asking "why is that piece slow?". That's where observability becomes a tool, not a collection of dashboards.

Sources