OpenTelemetry in production: Stop debugging in the dark

spinny:~/writing $ less opentelemetry-nodejs-observability-guide.md

1 
2The first time you really need observability is not when you're calmly looking at a dashboard. It's when a user writes "checkout is slow", the error graph looks normal and in the logs you only find a row of disconnected messages.
3 
4OpenTelemetry was created to avoid that moment: not to have more graphics, but to connect the pieces. A request enters the API, calls a database, goes through an external provider, posts a queued job, and maybe fails three services later. Without distributed tracing, you reconstruct that story by hand. With OpenTelemetry at least you have a map.
5 
6## The point isn't the trace, it's the story
7 
8A trace is a sequence of span. Put like that it sounds cold. In practice, each span is a piece of the story: `POST /checkout`, `SELECT inventory`, `call payment provider`, `publish order.created`.
9 
10The value comes when you start answering real questions:
11 
12- which external service is slowing down?
13- do the errors come from a specific version?
14- does the problem affect everyone or just one tenant?
15- is a retry hiding a timeout?
16- the asynchronous job starts but then dies somewhere else?
17 
18These questions cannot be solved by a `console.log` thrown in a hurry. Indeed, often the log added in an emergency helps you today and becomes noise tomorrow.
19 
20## How would I put this in an app Node.js
21 
22The healthiest setup is simple: the app produces telemetry, the Collector decides where to send it.
23 
24```text
25Node.js app -> OpenTelemetry Collector -> backend di observability
26```
27 
28Why not export directly to the vendor? Because at first it seems faster, then you realize that each service has different configurations, different retries, different filters and no central point to remove sensitive data or change destination.
29 
30The Collector is boring in all the right ways. It receives OTLP, does batching, can filter, can do sampling, can add common attributes and can export to multiple systems.
31 
32## Self-instrumentation: good, but not enough
33 
34In Node.js I would start with auto-instrumentation. It gives you immediate visibility into HTTP, supported frameworks, databases and common libraries.
35 
36```bash
37npm install @opentelemetry/sdk-node \
38  @opentelemetry/auto-instrumentations-node \
39  @opentelemetry/exporter-trace-otlp-http
40```
41 
42Then you initialize the SDK before the rest of the app:
43 
44```typescript
45import { NodeSDK } from '@opentelemetry/sdk-node';
46import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
47import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
48 
49const sdk = new NodeSDK({
50  traceExporter: new OTLPTraceExporter({
51    url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,
52  }),
53  instrumentations: [getNodeAutoInstrumentations()],
54});
55 
56sdk.start();
57```
58 
59However, this sees the framework, not your product. It knows you made a query, but it doesn't know that query was in "create order" or "renew subscription". For that you need span manuals at the points where dominance counts.
60 
61```typescript
62const span = tracer.startSpan('checkout.create_order');
63 
64try {
65  span.setAttribute('cart.items_count', input.items.length);
66  const order = await createOrder(input);
67  span.setAttribute('order.id', order.id);
68  return order;
69} catch (error) {
70  span.recordException(error as Error);
71  throw error;
72} finally {
73  span.end();
74}
75```
76 
77I wouldn't put span manuals anywhere. I would put them where, at three in the morning, I would like to understand what happened without reading half the code base.
78 
79## Three rules that avoid a lot of chaos
80 
81First rule: each service must have `service.name`, environment and version. It seems trivial, but without these attributes a trace is much less useful. When a deploy breaks something, you want to filter by version in two seconds.
82 
83Second rule: don't put sensitive data in the attributes. Emails, tokens, integer payloads, and addresses should not end up in an observability backend by accident. If you need to identify a user, consider internal IDs, hashing, or less sensitive fields.
84 
85Third rule: pay attention to cardinality. `user.id` as an attribute of trace can make sense. As a metric label it can destroy your costs and performance.
86 
87## Metrics: few, but good
88 
89I would start with very practical metrics:
90 
91- rates, errors and duration of requests;
92- latency of external dependencies;
93- number of timeouts and retries;
94- depth of the tails;
95- job duration;
96- percentage of errors per version.
97 
98The rest is added when needed. Dashboards full of graphs that no one looks at are furniture, not observability.
99 
100## Logs: Still useful, but linked
101 
102Logs don't disappear. They simply become much more useful when they carry `trace_id` and `span_id`. So you can start from an error log and open the trace, or start from a slow trace and read only the logs produced in that path.
103 
104Without correlation, you are looking for needles. With correlation, at least you know which drawer to look in.
105 
106## The checklist I would use before saying "we're covered"
107 
108- The trace actually cross multiple services.
109- Logs include `trace_id` and `span_id`.
110- The Collector is configured with batching and memory limits.
111- Errors are recorded in span.
112- There is a sampling policy.
113- Metrics have controlled cardinality.
114- Sensitive data is filtered.
115- Alerts start from user symptoms, not random graphs.
116 
117## Conclusion
118 
119OpenTelemetry does not solve production problems on its own. But the way you deal with them changes. Instead of blindly adding logs, you start following the actual path of a request.
120 
121For me the sign that it's working is simple: when something happens, the team stops asking "where are we looking?" and starts asking "why is that piece slow?". That's where observability becomes a tool, not a collection of dashboards.
122 
123## Sources
124 
125- [OpenTelemetry: Overview](https://opentelemetry.io/docs/specs/otel/overview/)
126- [OpenTelemetry Collector: Configuration](https://opentelemetry.io/docs/collector/configuration/)
127- [OpenTelemetry JavaScript: Node.js getting started](https://opentelemetry.io/docs/languages/js/getting-started/nodejs/)
128- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/concepts/semantic-conventions/)
129

:OpenTelemetry in production: Stop debugging in the darklines 1-129 (END) — press q to close

2The first time you really need observability is not when you're calmly looking at a dashboard. It's when a user writes "checkout is slow", the error graph looks normal and in the logs you only find a row of disconnected messages.

4OpenTelemetry was created to avoid that moment: not to have more graphics, but to connect the pieces. A request enters the API, calls a database, goes through an external provider, posts a queued job, and maybe fails three services later. Without distributed tracing, you reconstruct that story by hand. With OpenTelemetry at least you have a map.

6## The point isn't the trace, it's the story

8A trace is a sequence of span. Put like that it sounds cold. In practice, each span is a piece of the story: `POST /checkout`, `SELECT inventory`, `call payment provider`, `publish order.created`.

10The value comes when you start answering real questions:

12- which external service is slowing down?

13- do the errors come from a specific version?

14- does the problem affect everyone or just one tenant?

15- is a retry hiding a timeout?

16- the asynchronous job starts but then dies somewhere else?

18These questions cannot be solved by a `console.log` thrown in a hurry. Indeed, often the log added in an emergency helps you today and becomes noise tomorrow.

20## How would I put this in an app Node.js

22The healthiest setup is simple: the app produces telemetry, the Collector decides where to send it.

24```text

25Node.js app -> OpenTelemetry Collector -> backend di observability

26```

28Why not export directly to the vendor? Because at first it seems faster, then you realize that each service has different configurations, different retries, different filters and no central point to remove sensitive data or change destination.

30The Collector is boring in all the right ways. It receives OTLP, does batching, can filter, can do sampling, can add common attributes and can export to multiple systems.

32## Self-instrumentation: good, but not enough

34In Node.js I would start with auto-instrumentation. It gives you immediate visibility into HTTP, supported frameworks, databases and common libraries.

36```bash

37npm install @opentelemetry/sdk-node \

38 @opentelemetry/auto-instrumentations-node \

39 @opentelemetry/exporter-trace-otlp-http

40```

42Then you initialize the SDK before the rest of the app:

44```typescript

45import { NodeSDK } from '@opentelemetry/sdk-node';

46import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

47import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

49const sdk = new NodeSDK({

50 traceExporter: new OTLPTraceExporter({

51 url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,

52 }),

53 instrumentations: [getNodeAutoInstrumentations()],

54});

56sdk.start();

57```

59However, this sees the framework, not your product. It knows you made a query, but it doesn't know that query was in "create order" or "renew subscription". For that you need span manuals at the points where dominance counts.

61```typescript

62const span = tracer.startSpan('checkout.create_order');

64try {

65 span.setAttribute('cart.items_count', input.items.length);

66 const order = await createOrder(input);

67 span.setAttribute('order.id', order.id);

68 return order;

69} catch (error) {

70 span.recordException(error as Error);

71 throw error;

72} finally {

73 span.end();

74}

75```

77I wouldn't put span manuals anywhere. I would put them where, at three in the morning, I would like to understand what happened without reading half the code base.

79## Three rules that avoid a lot of chaos

81First rule: each service must have `service.name`, environment and version. It seems trivial, but without these attributes a trace is much less useful. When a deploy breaks something, you want to filter by version in two seconds.

83Second rule: don't put sensitive data in the attributes. Emails, tokens, integer payloads, and addresses should not end up in an observability backend by accident. If you need to identify a user, consider internal IDs, hashing, or less sensitive fields.

85Third rule: pay attention to cardinality. `user.id` as an attribute of trace can make sense. As a metric label it can destroy your costs and performance.

87## Metrics: few, but good

89I would start with very practical metrics:

91- rates, errors and duration of requests;

92- latency of external dependencies;

93- number of timeouts and retries;

94- depth of the tails;

95- job duration;

96- percentage of errors per version.

98The rest is added when needed. Dashboards full of graphs that no one looks at are furniture, not observability.

100## Logs: Still useful, but linked

101

102Logs don't disappear. They simply become much more useful when they carry `trace_id` and `span_id`. So you can start from an error log and open the trace, or start from a slow trace and read only the logs produced in that path.

103

104Without correlation, you are looking for needles. With correlation, at least you know which drawer to look in.

105

106## The checklist I would use before saying "we're covered"

107

108- The trace actually cross multiple services.

109- Logs include `trace_id` and `span_id`.

110- The Collector is configured with batching and memory limits.

111- Errors are recorded in span.

112- There is a sampling policy.

113- Metrics have controlled cardinality.

114- Sensitive data is filtered.

115- Alerts start from user symptoms, not random graphs.

116

117## Conclusion

118

119OpenTelemetry does not solve production problems on its own. But the way you deal with them changes. Instead of blindly adding logs, you start following the actual path of a request.

120

121For me the sign that it's working is simple: when something happens, the team stops asking "where are we looking?" and starts asking "why is that piece slow?". That's where observability becomes a tool, not a collection of dashboards.

122

123## Sources

124

125- [OpenTelemetry: Overview](https://opentelemetry.io/docs/specs/otel/overview/)

126- [OpenTelemetry Collector: Configuration](https://opentelemetry.io/docs/collector/configuration/)

127- [OpenTelemetry JavaScript: Node.js getting started](https://opentelemetry.io/docs/languages/js/getting-started/nodejs/)

128- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/concepts/semantic-conventions/)

129