|
| 1 | +# Telemetry conventions |
| 2 | + |
| 3 | +How to add telemetry so every instrumentation reads the same way. The framework |
| 4 | +lives in `src/telemetry`; the per-domain instrumentation business code talks to |
| 5 | +lives here in `src/instrumentation`. See `cli.ts` (spans + phases) and `ssh.ts` |
| 6 | +(point-in-time logs) as references. |
| 7 | + |
| 8 | +## Checklist |
| 9 | + |
| 10 | +- One instrumentation class per domain (`FooTelemetry`) wrapping |
| 11 | + `TelemetryService`; business code imports that, never a raw span. |
| 12 | +- Event name is `domain.snake_case`; point-in-time logs use past tense. |
| 13 | +- Event names and attribute keys follow OTel: lowercase, `.` for hierarchy, `_` |
| 14 | + to split words, never camelCase. Enumerated values are typed `snake_case` |
| 15 | + unions, never bare `string`. |
| 16 | +- Numbers go in `measurements` (raw), never pre-bucketed into string properties. |
| 17 | +- Set attributes imperatively with `setProperty`/`setMeasurement`; never add a |
| 18 | + return value that exists only to be logged. |
| 19 | +- No secrets, tokens, query strings, or unbounded values in properties; routes |
| 20 | + go through `normalizeRoute`. |
| 21 | +- Let the framework set `result`; add a domain `outcome` only when an operation |
| 22 | + has several success modes. Failures go to a typed union; non-error early exits |
| 23 | + call `markAborted`. |
| 24 | + |
| 25 | +## Layers |
| 26 | + |
| 27 | +- **Framework** (`src/telemetry`): `TelemetryService` (`trace`/`log`/`logError`) |
| 28 | + hands out `Span` handles and owns IDs, timing, `result`, level-gating, and the |
| 29 | + wire format. Telemetry-off is handled here (`NOOP_SPAN`), so instrumentation |
| 30 | + never checks whether telemetry is enabled. |
| 31 | +- **Instrumentation** (`src/instrumentation/*`): one typed class per domain, the |
| 32 | + only telemetry surface business code sees. |
| 33 | + |
| 34 | +## Threading |
| 35 | + |
| 36 | +Spans are passed **explicitly** as a callback argument; there is no |
| 37 | +ambient/active-span context. Two patterns keep telemetry out of business logic: |
| 38 | + |
| 39 | +1. **Imperative attributes** — `span.setProperty("outcome", "cache_hit")` at the |
| 40 | + point the value is known. This is the standard OpenTelemetry model. |
| 41 | +2. **Typed phases** — wrap an async step in `span.phase(...)` and read one |
| 42 | + property off its _natural_ return value, e.g. |
| 43 | + `trace.versionCheck(() => this.checkBinary(...))`. Extraction stays out of |
| 44 | + the business function. |
| 45 | + |
| 46 | +Never return a value purely so a caller can log it; that couples the return type |
| 47 | +to observability. Returning is fine when the business uses the value too. |
| 48 | + |
| 49 | +## Spans, phases, logs |
| 50 | + |
| 51 | +- `trace(name, fn, props?, meas?)` — a span with framework-set `result` and |
| 52 | + `durationMs`. Use for an operation with a start and end. |
| 53 | +- `span.phase(name, fn, ...)` — the same, nested (composed as `parent.child`); |
| 54 | + child names cannot contain `.`. |
| 55 | +- `span.log(name, ...)` / `span.logError(name, error, ...)` — point-in-time |
| 56 | + events under a span, no `result`/`durationMs`. Use for discrete signals. |
| 57 | + |
| 58 | +## Naming |
| 59 | + |
| 60 | +| Thing | Convention | Examples | |
| 61 | +| -------------------------- | ------------------------------------------- | --------------------------------------------------------- | |
| 62 | +| Event name | `domain.snake_case` | `cli.resolve`, `remote.setup`, `connection.dropped` | |
| 63 | +| Point-in-time log | past tense | `connection.dropped`, `ssh.process.lost` | |
| 64 | +| Child phase | bare `snake_case` | `cache_lookup`, `version_check`, `connection_handoff` | |
| 65 | +| Property / measurement key | lowercase; `_` splits words, `.` for groups | `cache_source`, `failure_category`, `status.from` | |
| 66 | +| Enumerated value | typed `snake_case` union | `"cache_hit"`, `"session_token"`, `"unrecoverable_close"` | |
| 67 | + |
| 68 | +This is the [OTel attribute convention](https://opentelemetry.io/docs/specs/semconv/general/naming/): |
| 69 | +`.` is the namespace delimiter, `_` joins words within a segment, never |
| 70 | +camelCase. Default to a flat `snake_case` key; use `.` only to group genuinely |
| 71 | +related attributes (a `status.from` / `status.to` pair). Keep phase names |
| 72 | +subject-first within a domain (`agent_resolve`, not `resolve_agent`). |
| 73 | + |
| 74 | +**Grouping.** Group related events under a shared dotted namespace so a prefix |
| 75 | +query returns the whole family: `workspace.update.triggered` and |
| 76 | +`workspace.update.prompted` both sit under `workspace.update.*`, as do |
| 77 | +`auth.token_refresh.completed` / `.deduped` under `auth.token_refresh.*`. Keep |
| 78 | +the namespace node a pure prefix; don't also emit an event with that exact name, |
| 79 | +since dotted names otherwise read as span phases (`parent.child`). This is |
| 80 | +[OTel namespacing](https://opentelemetry.io/docs/specs/semconv/general/naming/), |
| 81 | +which exists precisely so related signals query together. |
| 82 | + |
| 83 | +**Namespacing.** OTel suggests prefixing custom attributes (`com.acme.*`) to |
| 84 | +avoid clashing with future semconv. We don't namespace event-level attributes: |
| 85 | +each is already scoped by its (namespaced) event name and only flows into |
| 86 | +Coder's own pipeline, so a bare `cache_source` can't collide with a future OTel |
| 87 | +`cache.source`. Resource and provenance attributes stay namespaced |
| 88 | +(`coder.deployment.url`, `coder.event.*`). |
| 89 | + |
| 90 | +## Properties vs measurements |
| 91 | + |
| 92 | +- **Properties** are low-cardinality string dimensions (the framework stringifies |
| 93 | + `string | number | boolean`). Use them for what you group or filter by. |
| 94 | +- **Measurements** are raw numbers. Don't pre-bucket into string labels: both |
| 95 | + export as record attributes, and a query can bucket the raw number at read |
| 96 | + time. `result` and `durationMs` are framework-managed and cannot be set. |
| 97 | +- **Units.** There is no unit field at emit time, so put the unit in the |
| 98 | + measurement key as a `_ms` / `_mbits` suffix, the same way for every event. |
| 99 | + The OTLP exporter then resolves it per signal: for metric events |
| 100 | + (`http.requests`, `ssh.network.sampled`) it moves the suffix into the OTLP |
| 101 | + `unit` field and drops it from the metric name (`latency_ms` exports as metric |
| 102 | + `latency`, unit `ms`, which Prometheus then suffixes itself); log and span |
| 103 | + attributes keep the key as written. You always author `latency_ms`; only the |
| 104 | + exported metric name changes. |
| 105 | + |
| 106 | +## Outcomes, failures, aborts |
| 107 | + |
| 108 | +- The framework sets `result` (`success` / `error` / `aborted`) on every span; |
| 109 | + don't duplicate it. |
| 110 | +- Add a domain `outcome` property only when an operation has several success |
| 111 | + modes worth distinguishing (e.g. `cli.resolve`: `cache_hit`, `downloaded`). |
| 112 | +- Classify failures into a typed union (`failure_category`, or a domain |
| 113 | + `reason`) via a `categorize*Failure` helper rather than emitting raw error |
| 114 | + strings; the |
| 115 | + framework already captures the error block. |
| 116 | +- For a non-error early exit (cancelled, not-found), call `span.markAborted()` |
| 117 | + rather than throwing. |
| 118 | + |
| 119 | +## Safety |
| 120 | + |
| 121 | +Never put tokens, credentials, full URLs with query strings, or unbounded user |
| 122 | +input into properties. Routes go through `normalizeRoute` |
| 123 | +(`src/logging/routeNormalization.ts`). Prefer a closed union over a free-form |
| 124 | +string for any property a dashboard groups by. |
0 commit comments