Proposal: User telemetry, crash reporting, and session replay #203

miniMaddy · 2026-06-12T15:30:21Z

miniMaddy
Jun 12, 2026

Status: draft, awaiting community input. This is a research-backed shape, not a design we have committed to ship. The "Open decisions" at the bottom are the most useful place to weigh in — those answers materially change the implementation. Comments on the rest are very welcome too.

ReverbCode currently captures zero user analytics. We want to fix that, but want to do it in a way that's defensible from a privacy standpoint and consistent with the rest of the architecture (loopback-only daemon, durable facts + derived reads, ports/adapters, CDC via DB triggers). This post lays out a shape based on how four similar tools handle the same problem.

Why this exists

We want to understand:

Where the user is going — which features get reached, which paths get abandoned, what the typical lifecycle of a session looks like.
Where the user is getting stuck — drop-off in onboarding, failures that leave the user without a path forward, repeated attempts on the same action.
What crashed and when — daemon panics, adapter failures, frontend exceptions, with enough context to triage without a back-and-forth.
What a user does — high-fidelity stream of user actions (CLI invocations, UI clicks, spawns, sends, kills) without capturing the content of their work.
The ability to replay the recorded actions back, both for debugging and for reasoning about regressions.

Today the backend has zero user-analytics. It does have plenty of system observation: backend/internal/observe/ is the SCM/tracker poll loop, the change_log table already records every durable domain mutation via DB triggers, and the CDC poller broadcasts those to in-process subscribers. None of that surfaces user behavior.

Reference designs

Four similar tools were read end-to-end before writing this. The relevant mechanisms are summarised below; cited URLs at the bottom go to the exact sections.

Tool	Mechanism	What we should borrow
OpenAI Codex (advanced config)	Two-tier: `[analytics]` (anonymous, opt-out) + `[otel]` (rich, opt-in, `log_user_prompt = false` default). Dot-namespaced events (`codex.api_request`, `codex.tool.call`). Counter + histogram pairs per event. Project-local config cannot override telemetry keys.	Two-tier opt-in, dot-namespaced events, default-redacted prompts, trust boundary that locks down telemetry config to user/system scope.
Conductor (privacy doc)	PostHog. Captures `workspace created`, `model selected`, `message sent` (metadata only), provider errors with message, unexpected errors with message + stack. Explicit "no session recordings". `enterprise_data_privacy` toggle in `.conductor/settings.toml`.	Concrete event taxonomy; the no-session-recordings stance; repo-level enterprise toggle pattern.
OpenCode (share docs, config docs)	`share` setting: `"manual"`/`"auto"`/`"disabled"` controls cloud-sync of conversations. MDM-deployable managed configs on macOS (`.mobileconfig` via Jamf/Kandji/FleetDM) sit at top priority.	Tri-state user-facing setting; managed-config layer for enterprise enforcement (even before we have Mac signed builds).
Apache Superset (`superset/utils/log.py`)	`AbstractEventLogger` interface; `EVENT_LOGGER` config swaps backend at boot (`DBEventLogger`, `StdOutEventLogger`, statsd). Decorator-based instrumentation. Curated payload allowlist — only allowlisted keys reach the sink.	The sink interface, the allowlist-by-construction, the decorator pattern.

What we already have to build on

change_log (DB-triggered CDC, in backend/internal/storage/sqlite/migrations): every durable domain mutation is already captured chronologically with a monotonic sequence. This is half of "replay" already, for free.
backend/internal/cdc poller + in-process broadcaster: a tested fan-out primitive we can reuse to deliver telemetry events to multiple sinks.
Structured slog everywhere, with request IDs threaded through the chi router (in backend/internal/httpd/api.go).
Loopback-only HTTP API with an OpenAPI spec, regenerated from the apispec package — natural place to expose read endpoints over telemetry without re-inventing surface.
Single-writer SQLite pool: any telemetry persistence must respect this so triggers and reads stay consistent.
CLI is a thin client: the CLI cannot speak directly to a telemetry sink — it has to go through daemon HTTP, same as everything else.

Architectural constraints we must not violate

These are restated from AGENTS.md and docs/architecture.md because each rules out a "convenient" design that we'd otherwise reach for:

Daemon is loopback-only. Telemetry export to a remote collector must originate from the daemon, not be reachable from outside the daemon. We cannot bind an exporter to anything beyond 127.0.0.1.
CLI is a thin client. No direct sink writes from the CLI. CLI emits telemetry by calling a daemon endpoint, or via slog that the daemon picks up.
Don't store derived/display status. Same rule applies to events: persist durable facts, derive aggregates at read time. No daily-counter tables maintained by application code; that's a recipe for drift.
CDC events come from DB triggers into change_log. Do not bypass the trigger mechanism to write a parallel telemetry stream from store methods. New telemetry tables get their own triggers or a separate insert path that doesn't touch domain tables.
context.Context first for I/O. Sinks must be context-cancellable so shutdown is bounded.
No hand-edited sqlite/gen/*. Any new tables go through migrations/ + queries/ + npm run sqlc.

Proposal

Mental model: a fourth lane

The current model is OBSERVE → UPDATE → DERIVE / ACT. Telemetry adds a fourth lane:

OBSERVE external facts → UPDATE durable facts → DERIVE display status / ACT
                                              ↘ EMIT user/system events → SINK(s)

The lane is parallel to "ACT" and reads from the same sources (lifecycle/PR managers, CLI command runners, HTTP handlers). It never writes back to the domain.

One sink interface, several backends

Add a single port in backend/internal/ports/telemetry.go:

// EventSink consumes structured telemetry events. Implementations must be
// non-blocking from the caller's perspective: a slow or failing sink must
// never stall a user action.
type EventSink interface {
    Emit(ctx context.Context, ev Event) // best-effort, non-blocking
    Close(ctx context.Context) error    // drain on shutdown
}

Implementations live under backend/internal/adapters/telemetry/:

noop — default. Discards. Zero cost.
localsqlite — appends to a new telemetry_event table behind the single-writer pool. Bounded retention (rolling N days, hard cap by row count). Read-only HTTP surface for the CLI and a future debug dashboard.
otlp — OTLP/HTTP exporter, batched and async. Modeled on Codex's [otel] shape. Mapped fields: events become OTel logs, durations become histograms.
posthog — optional, only if we decide to take the Conductor route. Mapped 1:1 from Event → PostHog capture(). Strict allowlist; no PII.
fanout — composes multiple sinks; used by the daemon wiring to fan to both localsqlite (always, when telemetry is enabled at all) and the user's chosen remote sink.

This is the Superset/Codex pattern: behaviour and policy in the wiring layer, not in the call sites. Call sites only see EventSink.

Two-tier user control (Codex pattern, adapted)

Codex split telemetry into two settings because the privacy posture is fundamentally different between anonymous counters and rich event traces. We should do the same. The defaults below are the privacy-first reading; the opposite is a viable position — see "Open decisions".

Tier	Default	What it includes	Where it goes
`metrics`	off	anonymous counters + durations only, no per-user fields	`localsqlite` only unless remote is also enabled
`events`	off	full event records including user-action shape (still no content)	`localsqlite` only unless remote is also enabled
`remote`	off	upload `metrics` and/or `events` to a configured exporter	exporter URL must be explicitly set

Configuration lives in the existing env-only config layer:

AO_TELEMETRY_METRICS=off|on        # default off
AO_TELEMETRY_EVENTS=off|on         # default off
AO_TELEMETRY_REMOTE=off|otlp|posthog
AO_TELEMETRY_OTLP_ENDPOINT=https://otel.example.com/v1/logs
AO_TELEMETRY_OTLP_HEADERS_JSON={"x-otlp-api-key":"…"}
AO_TELEMETRY_REDACT_BRANCH_NAMES=true   # default true; project/branch names are sensitive

Per existing convention (no AO_HOST, etc.) these are env vars on the daemon, not flags. They can be inspected via ao doctor so the user can see what is actually configured.

Curated payload allowlist (Superset pattern)

Every event is a typed Go struct, not a map[string]any. The payload schema is the surface area we audit. A new event = a new struct + a new entry in the event-name constants. Free-form extra fields are not permitted at the call site:

type ProjectAddedEvent struct {
    ProjectIDHash string // sha256(project_id), not the id itself
    HasGitRemote  bool
    DurationMs    int64
}

The hashing is the same trick PostHog uses for "distinct_id": a stable opaque identifier the daemon can compute locally without ever leaving the machine. The backend stores the raw value alongside the hash so it can join for local debugging, but only the hash leaves the daemon.

Trust boundary for telemetry config

Following Codex: telemetry settings (AO_TELEMETRY_* and any future managed-config equivalent) are user-scope only. A project's checked-in .ao/settings file (if/when we add one) cannot turn on remote export or change the endpoint. This prevents a hostile repo from leaking events from anyone who clones it.

Crash bundles instead of always-on crash reporting

Conductor auto-uploads crash logs to PostHog with stack traces. That requires us to ship a stable identity, an upload endpoint, and a retention policy on day one. We can defer all of that and still solve the "what crashed and when" question with a CLI command:

ao bug-report           # default: last 24h of events + config snapshot + redacted
ao bug-report --since=7d --include-prompts=false

This produces a single .zip in the cwd. The contents:

All events from telemetry_event in the window
change_log rows in the window (already durable, already redacted of content)
A redacted snapshot of running.json and the daemon version
The last N lines of the daemon's slog output
A manifest listing exactly what's included so the user can inspect before attaching to an issue

The user attaches the zip to a GitHub issue. We never auto-upload anything without an explicit opt-in.

Replay means event playback, not screen recording

Conductor is explicit: "we don't capture or store any session recordings." That is the right line for us too. The replay capability is event playback, not terminal/UI capture, for these reasons:

Terminal capture inevitably contains agent output, file diffs, prompts, and source code. Sending that anywhere — even to ourselves — is a hard problem we should not take on yet.
change_log + the new telemetry_event table together already give us a chronological, durable, replayable history of what the user did and what the system observed.
Replay against a fresh DB in a test harness is straightforward when the events are durable facts; impossible when they are pixel buffers.

The replay tool is a separate, small thing:

ao replay <bug-report.zip>     # spins up an isolated daemon against a temp DB
                                # and feeds the recorded events through the same
                                # ingest paths.

This is achievable because everything in the backend already flows through the ports/adapters boundary, so injecting fakes for the runtime/workspace/agent adapters is the existing test pattern.

Event taxonomy (mapped to the five questions)

The names below are the initial set. Each has a typed struct; each is a distinct line item we can debate. All event names are dot-namespaced under ao.<domain>.<verb>.

"Where the user is going" (navigation + funnel)

Event	Trigger
`ao.daemon.started`	daemon boot
`ao.cli.invoked`	every CLI command runs (name only, never argv content)
`ao.onboarding.first_project_added`	first time a project row is created on this install
`ao.onboarding.first_session_spawned`	first ever session spawn
`ao.onboarding.first_pr_observed`	first PR row written by the PR manager
`ao.onboarding.first_merge`	first session that observes a merged PR

These are exactly the lifecycle waypoints the docs already call out. Aggregated, they answer "how far do new users get."

"Where the user is getting stuck"

Event	Trigger
`ao.cli.exit_2`	usage error path (we already exit 2 for these)
`ao.cli.repeated_failure`	same command fails ≥3× within 5min
`ao.daemon.error_envelope`	every API error response (status, code, request_id; no body)
`ao.spawn.failed`	session_manager.Spawn returns non-nil
`ao.adapter.unavailable`	runtime/workspace/agent adapter probe fails (kind + reason)
`ao.lifecycle.session_terminated_unexpected`	reaper marks a session terminated without an explicit kill

Pattern matches Conductor's "provider returned an error" + "unexpected error" shape.

"What crashed and when"

Event	Trigger
`ao.daemon.panic`	panic in the chi `Recoverer` middleware or in any tracked goroutine
`ao.daemon.shutdown_unclean`	next start finds a runfile with a dead PID (we already detect this)
`ao.adapter.panic`	adapter goroutine panicked (currently logged; we'd also emit)

Stack traces are included only when events tier is on, and only for daemon code (never user/agent code).

"What a user does"

The CLI verbs are the natural unit. One event per verb. The payload is a typed struct with allowlisted fields.

Event	Fields
`ao.project.added`	`project_id_hash`, `has_git_remote`, `duration_ms`
`ao.session.spawned`	`session_id_hash`, `agent_kind`, `runtime_kind`, `from_pr_branch` (bool)
`ao.session.killed`	`session_id_hash`, `reason ∈ {user,reaper,merged}`
`ao.session.restored`	`session_id_hash`
`ao.send`	`session_id_hash`, `body_len_chars` (length only, never text)
`ao.terminal.opened`	`session_id_hash`
`ao.doctor.run`	`failing_checks_count`, `os`, `arch`, `daemon_version`

"Replay it back"

Covered by ao bug-report + ao replay above. No additional events.

What we will not capture

Stating these explicitly so they don't quietly creep in via PR review:

Prompts, agent output, terminal scrollback. Length and counts only.
File paths, diff contents. A session's identifying fact off-machine is the hash, never the working-tree path or branch name.
Project / branch / PR titles. All redacted by default (AO_TELEMETRY_REDACT_BRANCH_NAMES=true). An enterprise user who wants names for self-hosted dashboards can opt back in for their own collector — but not the default.
Anything that travels between the user and their AI provider. Same line Conductor draws: we are not a network proxy for that traffic and we don't observe it.
IP addresses or hostnames at the application layer. PostHog/OTLP will see the source IP of the daemon's outbound HTTP request; that's unavoidable and must be documented.

Phasing (each step is a separately reviewable PR)

Plumbing only, default off. Add ports.EventSink, the noop and localsqlite adapters, the telemetry_event table + sqlc queries, the [telemetry] env config, and the new fourth lane wired into the daemon composition root. Instrument exactly two paths as a smoke test: daemon start/stop and ao.cli.invoked. No remote sinks yet. No CLI surface yet. This is the smallest "real but not load-bearing" PR.
Bug-report bundle. Implement ao bug-report over the daemon HTTP surface (new read endpoint that streams a zip). No upload — just the download. This is immediately useful for our own support workflow even if no events are wired beyond the smoke set.
Full event taxonomy + funnel events. Wire every event listed above through the existing services (session_manager, lifecycle, pr, doctor) at the points where the durable fact is already being written. Add tests that assert the event fires exactly once per fact (mirrors the change_log test style).
Remote sinks behind explicit opt-in. Add the otlp adapter, gated by AO_TELEMETRY_REMOTE=otlp + a non-empty endpoint. Optionally add posthog if the answer to Open Decision feat(backend): Lifecycle Manager + Session Manager lane #2 below is PostHog.
Replay command. ao replay <bug-report.zip> against an isolated daemon instance with fake adapters. Useful for our own regression work; ship it later, no rush.

Open decisions (these are where input is most useful)

Default state. Should metrics default to on (Codex / Conductor — more data, harder enterprise sell) or off (privacy-first — slower product feedback loop)? Current lean is off until we have a published privacy notice; Codex's hybrid (metrics on, events off) is a reasonable middle ground if we can stand up a notice quickly.
Remote sink: OTLP vs PostHog. OTLP is vendor-neutral and matches the self-hosted-friendly posture of the project, but we get nothing for free — we have to stand up a collector and a dashboard. PostHog is turnkey and is what Conductor uses, but it's a vendor relationship with an attached privacy policy we'd have to publish. We can support both behind the same sink interface; the question is which one we wire as "blessed."
Scope: backend-only or renderer too? The frontend is still a placeholder. Backend-only is the lowest-risk first slice. Adding a renderer-side analytics.ts later is independent and can reuse the same event names over an existing daemon route.
Replay scope. Confirm we are explicitly choosing event playback only and not full terminal/UI capture. Conductor went the same way and called it out as a feature, not a limitation. Current lean: same.
Crash auto-upload. Bug-report bundles cover the "user files an issue" case. Do we additionally want the daemon to auto-upload daemon.panic events when remote is enabled? Codex does (under [otel]); Conductor does (under PostHog). Worth a separate decision because the answer changes whether we can ever drop the manual bug-report path.

References:

Codex telemetry config — https://developers.openai.com/codex/config-advanced (search for "Observability and telemetry")
Conductor privacy & data — https://www.conductor.build/docs/reference/privacy
OpenCode share/MDM model — https://opencode.ai/docs/share/, https://opencode.ai/docs/config/
Superset AbstractEventLogger — https://github.com/apache/superset/blob/master/superset/utils/log.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: User telemetry, crash reporting, and session replay #203

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Proposal: User telemetry, crash reporting, and session replay #203

Uh oh!

miniMaddy Jun 12, 2026

Why this exists

Reference designs

What we already have to build on

Architectural constraints we must not violate

Proposal

Mental model: a fourth lane

One sink interface, several backends

Two-tier user control (Codex pattern, adapted)

Curated payload allowlist (Superset pattern)

Trust boundary for telemetry config

Crash bundles instead of always-on crash reporting

Replay means event playback, not screen recording

Event taxonomy (mapped to the five questions)

"Where the user is going" (navigation + funnel)

"Where the user is getting stuck"

"What crashed and when"

"What a user does"

"Replay it back"

What we will not capture

Phasing (each step is a separately reviewable PR)

Open decisions (these are where input is most useful)

Replies: 0 comments

miniMaddy
Jun 12, 2026