spinny:~/writing $ less opentelemetry-nodejs-observability-guide.md
12The first time you really need observability is not when you're calmly looking at a dashboard. It's when a user writes "checkout is slow", the error graph looks normal and in the logs you only find a row of disconnected messages.34OpenTelemetry was created to avoid that moment: not to have more graphics, but to connect the pieces. A request enters the API, calls a database, goes through an external provider, posts a queued job, and maybe fails three services later. Without distributed tracing, you reconstruct that story by hand. With OpenTelemetry at least you have a map.56## The point isn't the trace, it's the story78A trace is a sequence of span. Put like that it sounds cold. In practice, each span is a piece of the story: `POST /checkout`, `SELECT inventory`, `call payment provider`, `publish order.created`.910The value comes when you start answering real questions:1112- which external service is slowing down?13- do the errors come from a specific version?14- does the problem affect everyone or just one tenant?15- is a retry hiding a timeout?16- the asynchronous job starts but then dies somewhere else?1718These questions cannot be solved by a `console.log` thrown in a hurry. Indeed, often the log added in an emergency helps you today and becomes noise tomorrow.1920## How would I put this in an app Node.js2122The healthiest setup is simple: the app produces telemetry, the Collector decides where to send it.2324```text25Node.js app -> OpenTelemetry Collector -> backend di observability26```2728Why not export directly to the vendor? Because at first it seems faster, then you realize that each service has different configurations, different retries, different filters and no central point to remove sensitive data or change destination.2930The Collector is boring in all the right ways. It receives OTLP, does batching, can filter, can do sampling, can add common attributes and can export to multiple systems.3132## Self-instrumentation: good, but not enough3334In Node.js I would start with auto-instrumentation. It gives you immediate visibility into HTTP, supported frameworks, databases and common libraries.3536```bash37npm install @opentelemetry/sdk-node \38 @opentelemetry/auto-instrumentations-node \39 @opentelemetry/exporter-trace-otlp-http40```4142Then you initialize the SDK before the rest of the app:4344```typescript45import { NodeSDK } from '@opentelemetry/sdk-node';46import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';47import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';4849const sdk = new NodeSDK({50 traceExporter: new OTLPTraceExporter({51 url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,52 }),53 instrumentations: [getNodeAutoInstrumentations()],54});5556sdk.start();57```5859However, this sees the framework, not your product. It knows you made a query, but it doesn't know that query was in "create order" or "renew subscription". For that you need span manuals at the points where dominance counts.6061```typescript62const span = tracer.startSpan('checkout.create_order');6364try {65 span.setAttribute('cart.items_count', input.items.length);66 const order = await createOrder(input);67 span.setAttribute('order.id', order.id);68 return order;69} catch (error) {70 span.recordException(error as Error);71 throw error;72} finally {73 span.end();74}75```7677I wouldn't put span manuals anywhere. I would put them where, at three in the morning, I would like to understand what happened without reading half the code base.7879## Three rules that avoid a lot of chaos8081First rule: each service must have `service.name`, environment and version. It seems trivial, but without these attributes a trace is much less useful. When a deploy breaks something, you want to filter by version in two seconds.8283Second rule: don't put sensitive data in the attributes. Emails, tokens, integer payloads, and addresses should not end up in an observability backend by accident. If you need to identify a user, consider internal IDs, hashing, or less sensitive fields.8485Third rule: pay attention to cardinality. `user.id` as an attribute of trace can make sense. As a metric label it can destroy your costs and performance.8687## Metrics: few, but good8889I would start with very practical metrics:9091- rates, errors and duration of requests;92- latency of external dependencies;93- number of timeouts and retries;94- depth of the tails;95- job duration;96- percentage of errors per version.9798The rest is added when needed. Dashboards full of graphs that no one looks at are furniture, not observability.99100## Logs: Still useful, but linked101102Logs don't disappear. They simply become much more useful when they carry `trace_id` and `span_id`. So you can start from an error log and open the trace, or start from a slow trace and read only the logs produced in that path.103104Without correlation, you are looking for needles. With correlation, at least you know which drawer to look in.105106## The checklist I would use before saying "we're covered"107108- The trace actually cross multiple services.109- Logs include `trace_id` and `span_id`.110- The Collector is configured with batching and memory limits.111- Errors are recorded in span.112- There is a sampling policy.113- Metrics have controlled cardinality.114- Sensitive data is filtered.115- Alerts start from user symptoms, not random graphs.116117## Conclusion118119OpenTelemetry does not solve production problems on its own. But the way you deal with them changes. Instead of blindly adding logs, you start following the actual path of a request.120121For me the sign that it's working is simple: when something happens, the team stops asking "where are we looking?" and starts asking "why is that piece slow?". That's where observability becomes a tool, not a collection of dashboards.122123## Sources124125- [OpenTelemetry: Overview](https://opentelemetry.io/docs/specs/otel/overview/)126- [OpenTelemetry Collector: Configuration](https://opentelemetry.io/docs/collector/configuration/)127- [OpenTelemetry JavaScript: Node.js getting started](https://opentelemetry.io/docs/languages/js/getting-started/nodejs/)128- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/concepts/semantic-conventions/)129
:OpenTelemetry in production: Stop debugging in the darklines 1-129 (END) — press q to close