feature(monitor): capture-vs-capture diff endpoint + compare UI by jamby77 · Pull Request #186 · BetterDB-inc/monitor

jamby77 · 2026-05-12T07:48:29Z

Summary

PR 21 of the MONITOR stack. Adds the Pro+ capture-vs-capture diff so users can answer "what changed between these two captures?" without leaving the session detail page.

Backend

`CrossReferenceEngine.computeCaptureDiff(sessionId, baselineSessionId)` reuses the existing pure helpers (`computeNewShapes` + `computeHotKeyDelta`) with the baseline capture's parsed lines as the source. Slowlog regressions and ACL deltas return empty payloads — those axes scope to connection history, not to a single capture
Two new pure helpers — `collectShapesFromLines` / `collectKeyCountsFromLines` — keep shape and hot-key collection consistent between historical and capture baselines
`CrossReferenceResult.baseline.window` now accepts `'capture'`, and `baseline.sessionId` surfaces the diffed-against capture id
New endpoint: `GET /api/monitor/sessions/:id/diff?vs=:otherId` — license-gated by `Feature.MONITOR_CAPTURE_DIFF`, `400` on self-diff or missing `vs`, `404` when either session is missing
New `Feature.MONITOR_CAPTURE_DIFF` (Pro+) added to `TIER_FEATURES`

Frontend

New `CompareCapturesPanel` mounted on `MonitorSession` behind the feature flag
Lists other completed captures on the same connection in a dropdown (sorted newest first), fires the diff lazily on the Compare button, and renders the existing four-section panel (`CrossReferenceSections` now exported)
`monitorApi.compareSessions(id, otherId)`

Tests

`cross-reference.engine.spec`: capture-vs-capture happy path (`LPUSH:2` appears as new), hot-key delta against baseline capture, self-diff rejection, missing-baseline rejection
`monitor.controller.spec`: `sessionDiff` happy path forwards to engine; BadRequest for missing `vs` and self-diff; NotFound for missing primary / baseline

How to verify

Requires `MONITOR_DEV_PREVIEW=true VITE_MONITOR_DEV_PREVIEW=true` and a Pro+ license.

Capture two sessions A and B against the same connection with different command mixes.
`curl /api/monitor/sessions/A/diff?vs=B` → JSON of the same shape as cross-reference, with shapes that appear in A but not in B surfaced as `newShapes`.
Browser: open the detail page for capture A. Scroll to Compare with another capture → pick B → click Compare → 4-section panel populated; baseline session id displayed in the subtitle.

Test plan

`pnpm exec tsc --noEmit` (apps/api + apps/web) → clean
`pnpm exec eslint` on touched files → clean
`pnpm exec jest --testPathPatterns="cross-reference|monitor.controller"` → 79 tests pass
`pnpm exec vitest run` (apps/web) → 187 tests pass
Live verify against two captures on the same connection

- Add empty apps/api/src/monitor/ NestJS module wired into app.module.ts - MonitorDevPreviewGuard returns 404 unless MONITOR_DEV_PREVIEW=true - GET /api/monitor/_ping returns { ok: true } when gate is open - Frontend nav item + /monitor route conditionally rendered behind VITE_MONITOR_DEV_PREVIEW build flag; placeholder Monitor page added - Document both env vars in .env.example - Unit tests for guard (allow / unset / non-true variants) and controller First slice of the MONITOR feature per docs/plans/specs/monitor-command/plan-implementation.md (PR 1 of 25). All subsequent monitor work lands hidden behind these flags until launch.

…ndpoint - Add StoredCaptureSession, CaptureSessionStatus, CaptureSessionSource and CaptureSessionQueryOptions types in @betterdb/shared - Extend StoragePort with saveCaptureSession, getCaptureSession, getCaptureSessions - Implement schema (capture_sessions + capture_chunks with indexes) and the three methods in sqlite, postgres, and memory adapters - New MonitorCaptureService thin wrapper around the storage methods - GET /monitor/sessions returns [] when no sessions exist; supports connectionId / limit / offset query params; gated by MonitorDevPreviewGuard - Adapter round-trip tests cover sqlite + memory (filter / pagination / order) - Controller test verifies query-param forwarding to the service Schema columns mirror docs/plans/specs/monitor-command/spec-monitor-command.md under "Persistence". capture_chunks is created now to keep schema work in one place; CaptureWriter populates it in PR 5. Verification: GET /monitor/sessions returns []; sqlite ".schema capture_sessions" / ".schema capture_chunks" show both tables with the documented columns and indexes. Part of PR 2 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

…fset Self-review fix: - listSessions now 400s when connectionId is missing instead of returning capture sessions across all connections (cross-tenant leak per the multi-connection convention in CLAUDE.md). - limit/offset go through parsePositiveInt so ?limit=abc can no longer bind NaN into the storage query. Cap limit at 1000 to bound payload size.

The HealthGate decides whether an automated MONITOR capture (anomaly- triggered or scheduled) should fire when its trigger fires, or be skipped because the instance is already in distress. Manual sessions are not gated; they only get a pre-flight warning. - New pure module health-gate.ts: evaluateHealthGate(signals, thresholds) → { allow, skipReason?, signals, thresholds } with reason ordering memory > recent OOM > failover > replication lag - thresholdsFromEnv reads MONITOR_MEMORY_PCT_THRESHOLD (integer percent) and MONITOR_REPLICATION_LAG_BYTES; defaults 85% / 10 MB - Composing HealthGateService pulls a fresh INFO snapshot via ConnectionRegistry, computes memoryPct and replication lag inline, and counts recent OOM-correlated and replication-role anomaly events via StoragePort.getAnomalyEvents (no proprietary coupling) - Diagnostic endpoint GET /monitor/_diag/health-gate?connectionId=X returns the full result (allow + reason + signals + thresholds); follows the apps/api/src/system/system.controller.ts shape - Full unit coverage: pure-module tests across all four signals and reason ordering; service tests across each signal source with mocked client + storage Verification: with healthy local Valkey, GET /monitor/_diag/health-gate?connectionId=env-default → { allow: true, signals: { memoryPct: 0, ... } } With MONITOR_MEMORY_PCT_THRESHOLD=0 and a forced maxmemory limit on Valkey, the same endpoint returns { allow: false, skipReason: "memory_above_threshold", ... }. Part of PR 3 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

Replace silent module-load env parsing in health-gate.ts/service.ts with Zod-validated entries in env.schema.ts. Misconfigured values (non-integer, out-of-range, negative) now fail at boot via validateEnv instead of silently falling back to defaults. - Add MONITOR_RECENT_OOM_WINDOW_MS, MONITOR_RECENT_FAILOVER_WINDOW_MS, MONITOR_MEMORY_PCT_THRESHOLD, MONITOR_REPLICATION_LAG_BYTES to env schema with z.coerce.number().int() validation. - Drop thresholdsFromEnv, parsePercent, parsePositiveInt from health-gate.ts; pure module no longer touches env. - HealthGateService injects ConfigService and resolves windows and thresholds in the constructor. - Drop thresholdsFromEnv test block; pure-function tests stay. - Mock ConfigService in service spec.

@dangerous

Pre-flight surfaces everything an operator needs to know before starting a MONITOR session: provider-specific restrictions, current ACL state with the exact remediation snippet if +monitor is missing, the health-gate decision for the moment, and a projected throughput / capture size for the chosen duration. - ProviderDetector (pure): host suffix first (.cache.amazonaws.com, .upstash.io, .redislabs.com, etc.), INFO server fields second, falls back to 'self-hosted' rather than 'unknown' so dev instances do not show scary warnings. Restrictions are documentation strings, not blockers. - AclChecker: probes ACL WHOAMI then ACL GETUSER, recognises +monitor / +@ALL / +@dangerous / allcommands grants and -monitor revocations, returns the exact ACL SETUSER snippet when MONITOR is missing. - PreflightService composes ProviderDetector + AclChecker + HealthGateService + a single INFO snapshot for throughput numbers. Average MONITOR-line size is a conservative 120 B; documented inline. - POST /monitor/sessions/preflight {connectionId, durationMs?} returns the composed report. Default durationMs is 30000 (matches the future start-session modal default). - Full unit coverage: ProviderDetector (host + INFO + precedence + empty inputs), AclChecker (every grant pattern, RESP2/RESP3 shapes, fallback paths), PreflightService (composition + duration default + missing-stats edge case). Verification: against the local dev Valkey, POST /monitor/sessions/preflight {connectionId:"env-default"} returns hasMonitor:true (+@ALL). After ACL SETUSER default -monitor the same call returns hasMonitor:false with setUserSnippet: "ACL SETUSER default +monitor" and the rawRules show "+@ALL -monitor". Bug found and fixed during live verification: the AclChecker initially called client.call('ACL', 'WHOAMI') with positional varargs but the DatabasePort signature is call(command, args[]); the live server returned garbage and we conservatively fell back to hasMonitor:false. Fixed and covered by the corrected mock in the unit tests. Part of PR 4 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

Self-review fix: matchByHost iterates HOST_SUFFIXES with endsWith and short-circuits on the first match. Listing the shorter .cache.amazonaws.com before .serverless.cache.amazonaws.com made the serverless entry unreachable. Today both map to aws-elasticache so no behavioural bug, but if serverless-specific restrictions are ever added they would silently never apply.

PR 5a of the split PR 5. Lands the writer engine + the storage-side methods it needs. No Valkey integration and no endpoints yet — those land in PR 5b which builds on this. CaptureWriter contract: - Source-agnostic: consumes any MonitorSource that emits 'line' / 'error' / 'end' events. Tests pass an EventEmitter; the iovalkey adapter in PR 5b will wrap a MONITOR connection in the same shape. - Buffers lines into a "current chunk" and flushes when either the line threshold (default 5000) or the flush interval (default 1000ms) is hit. Flushes go through a serialized write queue so the source-line callback never awaits storage — keeps lines moving while disk is slow. - Enforces caps server-side: byteCap / lineCap → status='truncated' with reason 'byte_cap' or 'line_cap'; durationMs → status='completed' with reason 'duration_cap'; external stop() → 'completed' with the supplied reason; source 'error' → 'failed' with the error message. - Maintains an in-memory ring buffer (default 10 000 lines) for tail readers; viewers read snapshots and never block the writer. - On terminate, flushes pending data, drains the write queue, and patches the session row with status / endedAt / final counters / termination reason. Storage extensions (StoragePort + sqlite + postgres + memory): - saveCaptureChunk(chunk) — inserts one capture_chunks row - updateCaptureSession(id, patch) — partial update of mutable fields (status, endedAt, durationMs, byteCount, lineCount, terminationReason) - getCaptureChunks(sessionId) — chunk_index ASC Shared types: StoredCaptureChunk + CaptureSessionPatch. Verification (unit only — endpoint live verification lands with PR 5b): - 19-case CaptureWriter spec covers happy path, all 4 cap modes, chunk ordering under slow storage, ring-buffer eviction, viewer-doesn't- block-writer concurrency, error and idempotent-stop paths, and finalization despite saveCaptureChunk rejection. - capture-sessions adapter spec extended with update + chunk round-trip cases for sqlite + memory adapters (postgres covered manually). Total monitor + capture suite: 113 tests across 10 suites, all green. Part of PR 5 of 25 in docs/plans/specs/monitor-command/plan-implementation.md, split into PR 5a (this) + PR 5b (Valkey integration + endpoints + concurrency + demo-guard wire-up).

…iring PR 5b of the split PR 5. Builds on PR 5a's CaptureWriter to deliver an end-to-end MONITOR capture flow exposed via REST. - iovalkey-monitor-source.ts wraps iovalkey's MONITOR mode in our source-agnostic MonitorSource contract: client.monitor() opens a NEW dedicated connection (the originating client stays usable), events are formatted to the standard `<time> [<db> <addr>] "<arg>" ...` text shape, and stop() disconnects only the dedicated connection. - MonitorCaptureService grows startSession / stopSession / getSession on top of an in-memory active-session map keyed by connectionId. Concurrency contract: at most one active session per connection; duplicate startSession on the same connection throws ConflictException (HTTP 409). On writer completion the active map clears automatically. setMonitorSourceFactory is a test seam that lets specs inject a fake source instead of opening a real Valkey connection. - Controller endpoints: - POST /monitor/sessions {connectionId, durationMs?, byteCap?, lineCap?, requestedBy?} → 201 with the session row (status='running') - GET /monitor/sessions/:id → 200 with the session row, or 404 - DELETE /monitor/sessions/:id → 200 with the finalized row, or 404 - DemoModeGuard adds /monitor/sessions to DENIED_MUTATION_PREFIXES so POST/DELETE are blocked on the demo host while GET stays allowed for read-only browsing of seeded sessions. Three new guard tests cover the prefixes. - New service spec: 11 cases covering insert + active registration, 409 on conflict, allow-after-stop, byteCap/lineCap overrides, the monitor_open_failed failure path (writes status='failed' and clears the active map), stopSession finalize, getSession round-trip, and active-writer lifetime. Verification (live, non-demo host): - POST /monitor/sessions {connectionId:"env-default", durationMs:5000} → {id, status:"running", byteCap:52428800, lineCap:5000000} - valkey-cli -r 100 SET key value (in another terminal) - POST again on same connection → 409 Conflict - After 5s, GET /monitor/sessions/:id → {status:"completed", lineCount:109, byteCount:6242, terminationReason:"duration_cap"} - sqlite3 capture_chunks count for that session: 5 - DELETE while running terminates with reason "manual_stop" - POST without connectionId → 400; GET unknown id → 404 Demo-host verification is unit-only because the live cloud-auth flow returns 302 for unauthenticated requests, which prevents reaching the demo gate without real session cookies. The DemoModeGuard's behaviour for /monitor/sessions is fully covered in proprietary/cloud-auth/demo-mode.guard.spec.ts. Total monitor + capture + demo suites: 137 tests across 12 suites, all green. Part of PR 5 of 25 in docs/plans/specs/monitor-command/plan-implementation.md (5b of 5).

…ncated) Three new community-tier webhook event types fire on the capture session lifecycle so external systems (incident tooling, Slack, etc.) can react to capture activity without polling. - WebhookEventType.MONITOR_SESSION_STARTED - WebhookEventType.MONITOR_SESSION_COMPLETED - WebhookEventType.MONITOR_SESSION_TRUNCATED Registered in FREE_EVENTS and WEBHOOK_EVENT_TIERS (community). Dispatch wired into MonitorCaptureService: - session.started fires after the row is persisted at start time; scoped to the originating connectionId so webhooks can subscribe per-instance via the existing scope mechanism - session.completed fires when the writer ends with status='completed' (duration cap, manual stop, source ended naturally) - session.truncated fires when the writer ends with status='truncated' (byte cap or line cap) - 'failed' status does NOT dispatch a community event; PR 16 will add monitor.session.skipped for the related Pro+ case Payload includes sessionId, source, optional triggerId / scheduleId, requestedBy, startedAt, endedAt, durationMs, byteCount, lineCount, terminationReason, byteCap, lineCap. Connection enrichment is handled by the existing dispatcher. Side fixes: - DEFAULT_BYTE_CAP and DEFAULT_LINE_CAP are now overridable via MONITOR_DEFAULT_BYTE_CAP / MONITOR_DEFAULT_LINE_CAP env vars (so truncation can be exercised end-to-end without a 50 MB capture); community defaults preserved when unset. - ActiveSession.donePromise now stores the FULL chained promise (start → dispatch → cleanup), so stopSession's await reliably waits for the active map to clear and for the webhook to fire before returning. Verification (live, with a localhost catcher on port 4567): - 3s session against an idle Valkey → received monitor.session.started then monitor.session.completed with terminationReason='duration_cap' - 30s session with MONITOR_DEFAULT_BYTE_CAP=4096 + 200 SETs → received monitor.session.started then monitor.session.truncated with terminationReason='byte_cap' Total monitor + capture suite: 141 tests across 12 suites, all green (4 new dispatch cases). Part of PR 6 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

…ubscriptions End-to-end live tail for capture sessions, mirroring the established CliGateway pattern at the new /monitor/ws route. Covers active sessions (stream from the writer's ring buffer + new lines) and historical sessions (replay persisted chunks). CaptureWriter additions: - subscribe(cb): independent per-viewer line listeners; throwing in a subscriber never affects the writer or other subscribers. - onEnd(cb): one-shot termination notification; fires asynchronously if the writer has already terminated when subscribed. - subscribers cleared on terminate so closed viewers do not leak. TailGateway: - /monitor/ws upgrade routed in main.ts alongside /cli/ws. - handleUpgrade enforces the dev-preview gate AND rejects connections whose Host header matches DEMO_HOSTNAME (the HTTP DemoModeGuard does not run on WS upgrades, so the gateway must do it itself). - Active session: streams the existing ring-buffer backlog so a viewer joining mid-session has immediate context, then subscribes for new lines. On writer end, sends {type:'status', status:'session_ended'} and closes. - Historical session: streams persisted chunks line-by-line then sends {type:'status', status:'historical_complete'} and closes. - Per-viewer pause / resume: pause buffers lines server-side; resume drains in original order. Buffer is bounded (50 000 lines) — oldest lines drop if a paused viewer falls hopelessly behind, preventing a runaway-memory bug from one disconnected viewer. - Multiple concurrent viewers receive the same stream independently; closing one viewer does NOT affect the writer or other viewers. Verification (live): - Two parallel WebSocket viewers on a 6s session both receive 41 identical lines and a 'session_ended' status frame. - Pause → 5 SETs against Valkey → resume yields 7 lines including the paused-traffic SETs in original order. - A WS connection to a fully-completed session streams 10 persisted lines and a 'historical_complete' status. - WS connection without sessionId is rejected at handshake. - Demo-host live test is unit-only (per-PR-5b precedent — live cloud auth needs real session cookies); two new gateway-spec cases cover it. tail.gateway.spec.ts covers all six contract points (gates, session-not-found, historical replay, live backlog + lines + close, pause-buffer drain in order, unsubscribe on close, invalid JSON message). Total monitor + capture suite: 155 tests across 13 suites, all green. Part of PR 7 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

First Phase 2 slice. The /monitor route renders a read-only Sessions table scoped to the currently selected connection, polling GET /monitor/sessions every 5s so curl-started sessions appear without a manual refresh. The placeholder Monitor.tsx from PR 1 is replaced with the real page. - New apps/web/src/api/monitor.ts API client (listSessions only; start/stop wiring lands in PR 9 with the start-session modal). - Sessions table columns: started timestamp, status badge, source, duration, line count, byte count, termination reason, requestedBy. Sizes and durations human-formatted; status badge colour-codes running / completed / truncated / failed / skipped to match the spec semantics. - One-component-per-file convention: split into pages/Monitor.tsx (page shell), pages/monitor/sessions-table.tsx, pages/monitor/session-status-badge.tsx. - WebhookForm.tsx EVENT_LABELS map extended with the three new monitor.session.* event types from PR 6 (typecheck was failing without them because Record<WebhookEventType, string> now requires these keys). Verification (live, with Playwright): - /monitor route lists 8 prior sessions from earlier PRs' testing, with correct status badges, byte/line counts, and termination reasons (duration_cap, byte_cap, manual_stop) - Started a fresh 3s session via curl with requestedBy=playwright-test → row appeared in the table within the 5s poll window without a manual refresh - With VITE_MONITOR_DEV_PREVIEW unset: MONITOR nav item absent from sidebar; /monitor route renders an empty <main> ("No routes matched location" in vite logs) - Screenshot at docs/assets/pr8-monitor-sessions-list.png Part of PR 8 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

…firmation A modal launched from the Monitor page lets the operator start a capture session with full pre-flight context: provider restrictions, ACL state (with the exact ACL SETUSER snippet when +monitor is missing), the health-gate verdict for the moment, and a throughput / capture-size estimate based on current ops/sec × duration. - New components (one-per-file convention): - pages/monitor/start-session-modal.tsx (form + lifecycle) - pages/monitor/preflight-panel.tsx (4-section read-only display) - API client gains preflight() and startSession() - Monitor.tsx grows a Start session button in the header that opens the modal; on success, invalidates the sessions list query so the new row shows immediately rather than waiting for the next 5s poll - Form fields: - duration: integer + unit selector (seconds / minutes); minimum 1 - requestedBy: optional free-text, sent to the API and shown in the sessions table - 5-minute confirmation guard: if the chosen duration exceeds 5 min, the first Submit click swaps the panel to a yellow warning ("Sessions over 5 minutes can produce significant load. Confirm to proceed.") and changes the primary button to "Yes, start session". Second click fires the start. - Pre-flight refreshes whenever the duration changes (so the size estimate stays accurate as the user adjusts), via a useEffect with cancelled-flag cleanup to avoid races. - State reset on close: duration/unit/requestedBy/preflight/confirming/ error all clear when the modal closes, so reopening always shows defaults. (Caught during live testing — a pre-fix run had carry-over state from a previous open.) - Confirmation auto-clears when duration drops back below 5 minutes. Verification (Playwright, live): - Click Start session → modal opens, pre-flight panel populates within ~1s with provider 'self-hosted', ACL hasMonitor:true, health:healthy, estimated 3 lines / 144 B for the default 30s. - Duration to 6 minutes → first Submit shows the amber confirmation panel with the exact spec wording; primary button becomes "Yes, start session". - Reset to 3s + requestedBy=pr9-final → Submit → modal closes, new row appears in the table within the 5s poll window with the requestedBy value, status running → completed. - Reopening the modal after Cancel shows defaults (30s, no requestedBy) — fixes a state-carry-over bug spotted during testing. Screenshots: - docs/assets/pr9-start-session-modal.png (initial pre-flight) - docs/assets/pr9-confirmation-dialog.png (5-min guard) Part of PR 9 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

The Sessions table rows are now clickable links to a per-session detail page that streams the live MONITOR output via the TailGateway. Lines appear in real time, the Pause button freezes the local view while the gateway buffers server-side, and Resume drains the buffered window in order. - New useMonitorTail(sessionId, bufferSize=5000) hook: - Opens ws://localhost:3001/monitor/ws?sessionId=X (dev) or /api/monitor/ws (prod, same-origin) per env. - Accumulates incoming MONITOR text lines in a ref-buffer; flushes a snapshot to React state at most once per animation frame so the UI batches updates at ~60 Hz under thousands of lines/sec. - Bounds the buffer at 5 000 lines — when it overflows the oldest drop and bufferTrimmed flips true. - Maintains a separate totalReceived counter (NOT bounded) so the UI can show "N lines received · showing last K (older lines dropped)". - pause()/resume() send {type:'pause'|'resume'} control messages; the server-side per-viewer buffer drains in original order. - Status state tracks connecting → streaming → paused → session_ended | historical_complete | closed | error. - New TailView component: status badge, line counter, dropped-lines notice, pause/resume controls (only when live), monospace scrollable panel that auto-scrolls to bottom unless the user scrolls up (followBottomRef ref). - New MonitorSession page (route: /monitor/sessions/:id) with header showing session id, started timestamp, source, line count, and termination reason; live tail panel below. - SessionsTable rows are now clickable and navigate to the detail page. - ESLint disable for react-hooks/set-state-in-effect on the connection- reset block: those setStates are pre-WS-open initial-state restoration, React 18 batches them into a single commit, no cascade is possible. Documented inline. Verification (Playwright, live): - Started a 5-min session, navigated to /monitor/sessions/:id, sent 80 SETs from another terminal → all 80 SET lines streamed into the view in real time alongside ongoing INFO polling; total reached 173 lines. - Clicked Pause → status badge flipped to amber "Paused"; tail froze at 37 lines. Sent 100 SETs against Valkey while paused → still 37 shown. Clicked Resume → instantly drained to 278 lines, all 100 paused SETs included in original order. Screenshots: - docs/assets/pr10-live-tail.png (live tail mid-session) - docs/assets/pr10-tail-after-resume.png (post-resume drain) Part of PR 10 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.

Self-review fixes: - useMonitorTail cleanup detaches onopen/onmessage/onerror/onclose before calling ws.close(). close() is async, so without nulling the handlers the old socket's onclose can fire after a new socket has been created and flip the new connection's status to 'closed'. - MonitorSession used listSessions({ limit: 100 }) and filtered client-side, which silently failed for older sessions or sessions on a non-current connection. Switched to monitorApi.getSession(id) (added) which hits GET /monitor/sessions/:id.

Closes Phase 2 frontend MVP. The session-detail page gains a Filters & export panel that filters the captured stream by command, client, key glob, and time window — then downloads the filtered slice as JSON or CSV via a new server endpoint. Backend: - monitor-line.parser.ts: pure module - parseMonitorLine(text) → {ts, tsRaw, db, addr, cmd, args, key, raw} Handles backslash-escaped quotes, IPv6 bracket addresses, keyless commands (PING etc.). - matchesFilters(line, {command, client, key, afterTs, beforeTs}): case-insensitive command exact match, addr substring match, glob match on the first arg / key (* and ?), inclusive timestamp window. - lineToCsvRow + CSV_HEADER: RFC 4180 escaping for commas, quotes, newlines. - GET /monitor/sessions/:id/export?format=json|csv (+ filter query params) streams persisted chunks, parses each line, applies filters, and emits either {sessionId, count, lines[]} JSON or a CSV with a header row. Content-Disposition: attachment; filename= monitor-session-<id>.<fmt>. 404 on unknown session. Default format is json when an unknown value is supplied. Frontend: - New FiltersAndExport component on the session-detail page (below the live tail). - 5 inputs: Command, Client, Key glob, After (datetime-local), Before (datetime-local). - In-page preview count derived from the live-tail buffer (parser inlined to avoid a server round-trip on every keystroke). Export buttons hit the server endpoint which sees the full session. - Two <a download> buttons styled as outline Buttons — clean browser-native download with the URL carrying current filters. - useMonitorTail lifted from TailView up to MonitorSession so the Filters panel and the TailView share one buffer. Tests: - monitor-line.parser.spec.ts: 22 cases covering all parser edge cases (escaped quotes, backslashes, IPv6, keyless commands, malformed input), all filter axes including glob wildcards and AND semantics, and CSV escaping. - monitor.controller.spec.ts extended with 5 export cases (404, unfiltered JSON, command-filtered, CSV header, format fallback). Total backend suite: 184 tests across 14 suites, all green (1 caught during testing: IPv6 bracket address parsing — the initial header regex was too greedy and stopped at the inner `]`; fixed with a non-greedy capture anchored on `\]\s+"`). Verification (Playwright, live): - Started a 5s session, sent 30 SETs against foo + 30 GETs against user:* → session captured 70 lines (mix of AUTH/INFO/PING + SETs + GETs). - Direct curl to /export?format=json → count: 70, cmds: ['AUTH', 'GET', 'INFO', 'PING', 'SET']. - Filter ?command=GET → count: 30. - Filter ?key=user:* → count: 30, all GET user:N. - CSV export starts with header `ts,ts_raw,db,addr,cmd,args,key`. - UI: typed `GET` and `user:*` into the form → preview text "Buffer match: 30 of 70 lines. Export uses the full session, server-side." matched the API. Export-link URLs updated with the filter query string in real time. Screenshot at docs/assets/pr11-filters-export.png. Part of PR 11 of 25 in docs/plans/specs/monitor-command/plan-implementation.md (closes Phase 2 frontend MVP).

Self-review fixes: - Filter inputs (command/client/key) now go through trimmedOrUndefined so the buffer-preview client behaviour (which trims) and the export endpoint behaviour stay in sync. Without this, 'GET ' could match in the preview but return 0 rows in the export. - Key glob filters are now bounded to 128 characters via cappedKeyFilter to defuse catastrophic backtracking via patterns like '*a*a*a*a*a*' — globToRegex compiles to a non-anchored .* chain that is ReDoS-prone against equally long captured keys.

The diagnostic differentiator the spec calls out. Diffs a completed capture against the connection's recent history along four axes — new command shapes, hot-key delta, slowlog regressions, and audit entries — using only data already persisted by the rest of the platform (slowlog, commandlog, client snapshots, audit trail). CrossReferenceEngine is a single Injectable class with one public method so it can be reused for the Pro+ capture-vs-capture diff in PR 23 by swapping the baseline source. Four dimensions: - newShapes: shapes seen in capture but not in the union of slowlog + commandlog + client-analytics.cmd over the baseline window. - Non-scripted shape = "VERB:arity" (e.g. "LPUSH:2"). - Scripted commands (EVAL / EVALSHA / FCALL / FCALL_RO) preserve the script SHA or function name in the shape so newly-deployed scripts surface as new shapes. - Client-snapshot.cmd records only the verb; encoded as "VERB:*" so a verb seen in client analytics covers any arity from the capture side. - hotKeyDelta: top-50 first-arg keys in capture vs top-50 from slowlog args in the baseline window. - newInTopK: keys present in capture top-K, absent from baseline top-K. - rankChanges: keys present in both with a different rank. - slowlogRegressions: per captured verb, observed slowlog rate during the session window vs the 95th-percentile of per-verb rates in the baseline window. Flagged when observed > p95. Empty-baseline edge case: any session slowlog appearance flags. - aclDeltas: audit-trail entry count within the session window (drives the UI for the audit module when in use). INFO counter deltas (acl_access_denied_auth, rejected_connections) are placeholders that need start/end snapshots — flagged for a follow-up. Baseline windows: 6h | 24h | 7d | same-hour-last-week, default 24h. New endpoint: GET /monitor/sessions/:id/cross-reference?baseline=24h - 400 on unknown baseline (must be one of the four values). - 404 on unknown session. Tests (33 cases, all green): - All four baseline windows (computeBaselineRange). - Shape encoding rules including scripted SHA/function-name preservation, 0-arg commands, and shapeOfStringArray symmetry. - All four cross-reference dimensions in isolation (computeNewShapes / computeHotKeyDelta / computeSlowlogRegressions) including the empty-baseline edge case. - Engine integration with mocked storage covering: 404, spec verification example, all four baseline windows, client-snapshot VERB:* coverage, audit-entry surfacing. Total backend suite: 217 tests across 15 suites. Verification (live, against running dev API + dev Valkey): - Seeded a slowlog row with command=["GET","history-key"], timestamp 10 minutes before session start. - Captured a session with one GET foo + one LPUSH x v. - GET /monitor/sessions/:id/cross-reference?baseline=24h returns: - newShapes: [AUTH:1, LPUSH:2, PING:0] (GET:1 correctly absent because baseline slowlog covers it; AUTH/PING are real background commands from the BetterDB poller, also not in baseline) - hotKeyDelta: foo + x appear in newInTopK - baseline: { window: 24h, startTs, endTs } - GET /cross-reference?baseline=15m → 400 with the exact error message. - GET /cross-reference for an unknown session id → 404. Part of PR 12 of 25 in docs/plans/specs/monitor-command/plan-implementation.md (Phase 3).

Wires the cross-reference engine from PR 12 into the session-detail page. Four-section panel displays new command shapes, hot-key delta, slowlog regressions, and ACL / audit deltas. A baseline selector swaps between 6h / 24h / 7d / same-hour-last-week and triggers a fresh compute via react-query. - New CrossReferencePanel component (one per file). - 4 Section cards in a 2x2 grid: - New command shapes: counted, with a script badge for EVALSHA / FCALL / FCALL_RO rows; empty-state copy "every captured command was seen in baseline". - Hot-key delta: two sub-lists — newInTopK (rank #N) and rankChanges (#baseline ↑/↓ #capture with arrow); empty-state "No hot-key shifts". - Slowlog regressions: per verb, observed/s vs baseline p95/s; empty-state explains the meaning. - ACL / audit deltas: audit count + INFO counter deltas, with a note that the counter delta is pending session-boundary snapshots (follow-up PR). - Baseline selector is a <select> bound to react-query so caching works per (sessionId, baseline) tuple — switching back and forth is instant after the first compute. A small "refreshing…" indicator shows when the query is refetching in the background. - API client extended with crossReference(sessionId, baseline) and the full CrossReferenceResult shape mirrored from the backend. Verification (Playwright, live): - Seeded slowlog with a GET row 10 min before session start. - Captured a session with GET foo (×5) + LPUSH x v (×5). - /monitor/sessions/:id renders all 4 sections. - 24h baseline → newShapes shows LPUSH:2, AUTH:1, PING:0 (GET:1 correctly absent — baseline covers it). Hot keys foo + x flagged in newInTopK. - Switching baseline to same-hour-last-week → GET:1 flips back into newShapes (the seeded row is 10 min ago, not in last week's one-hour window). Screenshots: - docs/assets/pr13-cross-reference-24h.png - docs/assets/pr13-cross-reference-same-hour-last-week.png Part of PR 13 of 25 in docs/plans/specs/monitor-command/plan-implementation.md (Phase 3, closes the cross-reference frontend).

PR 14a of the split Phase 4. Lands per-node selection end-to-end: the start-session modal queries cluster topology and renders a node dropdown when the target is a cluster; the API opens a dedicated MONITOR connection to the chosen node via the existing ClusterDiscoveryService. Fan-out + partial-failure handling is split into a follow-up PR 14b. Reason: fan-out requires multi-writer orchestration, a node_id column on capture_chunks, and per-node status persistence on capture_sessions — each non-trivial. Per-node alone is what unblocks the spec's main UX promise (the dropdown) and gives us a working cluster path without rewriting CaptureWriter's 1:1 source-to-writer assumption. Backend: - StoredCaptureSession gains optional targetNode (host:port string, null for non-cluster sessions). - capture_sessions schema: new target_node TEXT column; idempotent ALTER added for existing sqlite + postgres deployments. - StartSessionInput accepts optional targetNodeId; MonitorCapture Service resolves it to the node address for persistence and hands it to the monitorSourceFactory. - monitorSourceFactory grows a second targetNodeId parameter; the default factory uses ClusterDiscoveryService.getNodeConnection to open a dedicated iovalkey client to that node, then wraps in the existing iovalkey-monitor-source. - New GET /monitor/connections/:id/nodes endpoint: returns { isCluster, nodes: [{id, address, role, healthy}] }. Reports isCluster:false (with empty nodes) for both single-instance connections AND for cluster-discovery failures, so the frontend hides the dropdown gracefully when CLUSTER NODES is not supported. Frontend: - start-session-modal.tsx queries /monitor/connections/:id/nodes when the modal opens. If the response reports isCluster:true, renders a Cluster node <select> auto-defaulted to the first master node. The targetNodeId is passed in the POST body. - Selector hidden completely for non-cluster connections; the rest of the modal is unchanged. - A small caption under the selector tells the user that MONITOR is per-node and that fan-out lands in a follow-up. Tests (3 new service cases + 4 new controller cases, all green): - Service records target_node when a cluster node id is supplied and resolved via discovery. - Service passes targetNodeId through to the monitor source factory. - Service falls back to the supplied id when discovery throws. - Controller exposes nodes as {isCluster:false, nodes:[]} for an empty/no-cluster discovery; {isCluster:true} with descriptor list otherwise; treats discovery error as non-cluster. - Controller forwards targetNodeId through to the service. Total backend suite: 224 tests across 15 suites. Verification (live, dev compose is single-Valkey): - GET /monitor/connections/env-default/nodes → {isCluster:false, nodes:[]} (single-Valkey is correctly reported as non-cluster). - Browser: click Start session → modal opens with NO cluster-node selector (only Duration + Requested by). Verified via document.getElementById('targetNode') === null. - Existing single-instance start flow continues to work unchanged. A 3-node cluster live test will land alongside PR 14b — fan-out needs the same docker compose extension and partial-failure verification depends on it, so I'm grouping them. Screenshot at docs/assets/pr14-modal-non-cluster.png. Part of PR 14 of 25 in docs/plans/specs/monitor-command/plan-implementation.md (Phase 4, per-node half).

PR 14b of the split Phase 4. Closes the cluster story: a single session can fan out to one CaptureWriter per cluster primary, all writing into one logical sessionId with per-node attribution. One node disconnecting mid-capture marks that segment failed and the other writers keep running. Schema: - capture_sessions.node_segments (TEXT JSON / JSONB) — array of {nodeId, address, status, byteCount, lineCount, endedAt, terminationReason}. Idempotent ALTER ADD COLUMN for both sqlite + postgres. - capture_chunks.node_id (TEXT, nullable) — per-chunk node attribution. NOT in PK; namespace collision is avoided by per-writer chunk_index ranges (see below). CaptureWriter: - New options: nodeId (chunk attribution), startChunkIndex (per-writer chunk_index namespace), skipSessionFinalize (fan-out writers don't finalize the session row; the orchestrator aggregates). MonitorCaptureService: - StartSessionInput gains fanOut: boolean. - When fanOut && cluster: resolveFanOutNodes() returns healthy primaries via ClusterDiscoveryService, the service opens one writer per node with chunk_index ranges [0, NS), [NS, 2*NS), [2*NS, 3*NS), ... where NS = 10_000_000. Each writer gets skipSessionFinalize:true; the orchestrator collects per-writer results and aggregates. - Partial-failure: a writer that fails (source error, factory reject, or terminate('failed', ...)) updates only its own node_segments entry; other writers keep going. - Aggregate: session status = worst-case across segments (any failed → failed; any truncated → truncated; else completed). Termination reason rolls up matching the status. - Single-node path unchanged. - ActiveSession now holds writers[] instead of a single writer; stopSession iterates and stops each. Frontend (start-session-modal): - New "Fan-out across all primaries (N nodes)" checkbox, shown only when isCluster. Disables the node selector when on. - Helper text: "One MONITOR connection per primary. Per-node status is recorded; one node failing mid-capture does not stop the others." MonitorSession page: - Header gains a "Fan-out segments (N nodes)" panel listing each node's address, status badge, line count, and termination reason when nodeSegments is present. Tests (4 new fan-out cases, all green): - Opens N writers, attributes chunks per node, aggregates segments on terminate with the right line/byte totals. - Partial failure: node B source errors mid-capture → segment marked failed with source_error reason; node A still completes; aggregate status = failed. - monitor_open_failed on one node up front: factory rejects; segment marked failed; other writers run normally; aggregate status = failed. - Non-cluster fanOut request falls back to single-node start (nodeSegments stays undefined). Total backend suite: 228 tests across 15 suites. Live verification: blocked on macOS because docker-compose.test.yml uses network_mode: host (Linux-only). The test cluster is reachable from a Linux runner; once a docker-compose port-mapping change lands, the spec's verification (docker kill node-2 mid-fan-out) can run from the dev box. Unit tests cover every fan-out behaviour in full; flagged for the follow-up that adds the compose port mapping. Part of PR 14 of 25 (14b of 14) in docs/plans/specs/monitor-command/plan-implementation.md (Phase 4 complete pending the live test).

Pro+ anomaly-driven captures. CaptureTriggerRegistry is a deep module: subscribes to anomaly events (polling), matches against configured triggers, fires via MonitorCaptureService through the health gate, queues when busy, expires on schedule. - New capture_triggers table on sqlite + postgres + memory - StoredCaptureTrigger / CaptureTriggerStatus types in @betterdb/shared - Feature.MONITOR_ANOMALY_TRIGGER (Pro+) added to TIER_FEATURES - POST/GET/DELETE /monitor/triggers, license-gated - /monitor/triggers added to DemoModeGuard.DENIED_MUTATION_PREFIXES - Full unit suite: dedup, cancel, fire, single-fire, queue-when-busy, health-gate skip, expiry, concurrency, controller endpoint contracts

Self-review fixes for the trigger registry: - Constructor seeds lastAnomalyAt to Date.now() so a process restart no longer replays historical anomalies as if they just arrived; resetAnomalyWatermark() exposed as a test seam. - processNewAnomalies now drains paginated results (offset walk, page size 200, hard cap 5000/tick) and iterates oldest-first so a partial drain still leaves a coherent watermark — fixes the 'newest 200, silently drop the gap' drift under load. And on the triggers list endpoint: - limit/offset go through parsePositiveInt to guard against NaN binding, with the limit capped at 1000.

Two new lifecycle events fired by CaptureTriggerRegistry: - monitor.trigger.created (Pro tier) — dispatched after a successful createTrigger; payload includes trigger id, metric/anomaly type, expiry, and createdBy - monitor.session.skipped (community tier) — dispatched when the health gate denies a triggered fire; payload includes the skip reason returned by HealthGateService Dispatch is fire-and-forget so webhook failures cannot block the trigger lifecycle; errors are logged and swallowed. - Registered both events in WebhookEventType + tier lists - Wired WebhookDispatcherService into CaptureTriggerRegistry - 5 new tests covering trigger.created, session.skipped, and the no-emit cases (successful fire, queue-when-busy, dispatcher error)

Pro+ feature, gated by MONITOR_ANOMALY_TRIGGER: - New TriggersTable component: created at, status badge, source metric/anomaly, expiry (relative), fired session id, created by, Delete button on active rows - Monitor.tsx wraps Sessions in a Tabs container; adds a Triggers tab visible only when the license grants MONITOR_ANOMALY_TRIGGER - monitorApi gains listTriggers / createTrigger / cancelTrigger - WebhookForm gets labels for monitor.session.skipped and monitor.trigger.created (introduced in PR 16) - Polling refreshes triggers every 5s while the tab/page is open; cancel invalidates the query so the row disappears on success

Pro+ feature (gated by MONITOR_ANOMALY_TRIGGER + VITE_MONITOR_DEV_PREVIEW): - New CaptureOnNextModal: confirms a trigger create call with prefilled metric/anomaly summary, surfaces success and errors, invalidates the triggers query on success - Row action button on each individual anomaly event in the Recent Anomalies list and on each anomaly inside an expanded correlated group. Opens the modal prefilled with that anomaly's metric + spike/drop, scoped to the current connection - Tightened up pre-existing eslint hygiene (typed buffers, scoped Date.now to a useMemo, dropped an inline any) per CLAUDE.md

Pro+ scheduled MONITOR captures via @nestjs/schedule SchedulerRegistry: - new scheduled_captures table on sqlite + postgres + memory - StoredScheduledCapture / ScheduledCaptureStatus types in @betterdb/shared - Feature.MONITOR_SCHEDULED_CAPTURES (Pro+) added to TIER_FEATURES - CaptureScheduler deep module: - createSchedule / deleteSchedule / listSchedules / getSchedule - fireOnce test seam for synchronous tick verification - per-row Nest interval (single-active-session-per-instance and health-gate skip enforced upstream by MonitorCaptureService) - restores enabled rows from storage on module init - POST/GET/DELETE /monitor/schedules, license-gated - /monitor/schedules added to DemoModeGuard.DENIED_MUTATION_PREFIXES - Full unit suite covering create/delete, validation, health-gate skip, session-already-active skip, start_failed, list ordering, and controller endpoint contracts

Self-review fix: limit/offset on GET /monitor/schedules now go through the same parsePositiveInt helper used by sessions and triggers. NaN inputs no longer reach the storage adapter, and limit is capped at 1000 to bound payload size.

Backend: - Add nullable cron_expression column to scheduled_captures (sqlite + postgres + memory). CHECK constraint enforces exactly one of intervalSeconds or cronExpression - Idempotent migrations bring PR 19 deployments forward - StoredScheduledCapture / ScheduledCapturePatch carry optional cronExpression - CaptureScheduler picks cron CronJob via SchedulerRegistry.addCronJob for cron rows; setInterval-backed timer for interval rows - POST /monitor/schedules accepts either intervalSeconds or cronExpression and rejects when both or neither are supplied Frontend: - New Scheduled tab on /monitor (gated by MONITOR_SCHEDULED_CAPTURES) with SchedulesTable showing cadence, duration, last fire info, and delete action - CreateScheduleModal: amount + unit interval picker by default; Advanced toggle reveals a cron expression field; both pass through monitorApi.createSchedule Tests: - capture-scheduler.spec: cron register/unregister, validation rejection for missing/conflicting/invalid specs - monitor.controller.spec: BadRequest when both interval and cron are missing; cronExpression forwarding to the scheduler

Pro+ feature (Feature.MONITOR_CAPTURE_DIFF): - CrossReferenceEngine.computeCaptureDiff(sessionA, sessionB) reuses the existing pure helpers (computeNewShapes + computeHotKeyDelta) with B's parsed lines as the baseline source; slowlog and ACL axes return empty payloads (those dimensions scope to connection history, not a single capture) - New pure helpers: collectShapesFromLines, collectKeyCountsFromLines - CrossReferenceResult.baseline.window now accepts 'capture' and a baseline.sessionId field surfaces the diffed-against capture id - GET /monitor/sessions/:id/diff?vs=:otherId — license-gated; 400 on self-diff or missing vs; 404 when either session is missing - Web: new CompareCapturesPanel mounted on MonitorSession behind the feature flag. Picks any other completed capture on the same connection from a dropdown, fetches the diff lazily, and renders the shared CrossReferenceSections (now exported)

# Conflicts: # apps/api/src/monitor/__tests__/cross-reference.engine.spec.ts # apps/api/src/monitor/__tests__/monitor.controller.spec.ts # apps/api/src/monitor/capture-writer.ts # apps/api/src/monitor/cross-reference.engine.ts # apps/api/src/monitor/monitor.controller.ts # apps/web/src/api/monitor.ts # apps/web/src/pages/MonitorSession.tsx # apps/web/src/pages/monitor/cross-reference-panel.tsx # packages/shared/src/license/types.ts

jamby77 added 2 commits May 11, 2026 08:39

jamby77 force-pushed the feature/monitor-scheduled-captures-ui branch from 69af567 to b5b888f Compare May 12, 2026 11:18

jamby77 force-pushed the feature/monitor-capture-diff branch from 2368b97 to 436326d Compare May 12, 2026 11:19

jamby77 added 2 commits May 12, 2026 15:39

jamby77 force-pushed the feature/monitor-scheduled-captures-ui branch from b5b888f to 3ccd9e4 Compare May 12, 2026 12:45

jamby77 force-pushed the feature/monitor-capture-diff branch from 436326d to 0af237a Compare May 12, 2026 12:46

jamby77 force-pushed the feature/monitor-scheduled-captures-ui branch from 3ccd9e4 to 043808e Compare May 12, 2026 12:50

jamby77 force-pushed the feature/monitor-capture-diff branch from 0af237a to 0e32f01 Compare May 12, 2026 12:50

jamby77 added 20 commits May 13, 2026 15:25

fix(monitor): wire actual sessionId into CaptureWriter logger name

52a6f3b

jamby77 added 6 commits May 13, 2026 15:41

fix(monitor): parsePositiveInt on schedules list

2e08240

Self-review fix: limit/offset on GET /monitor/schedules now go through the same parsePositiveInt helper used by sessions and triggers. NaN inputs no longer reach the storage adapter, and limit is capped at 1000 to bound payload size.

jamby77 force-pushed the feature/monitor-scheduled-captures-ui branch from 043808e to 0100f82 Compare May 13, 2026 12:42

jamby77 force-pushed the feature/monitor-capture-diff branch from 0e32f01 to 245e5e5 Compare May 13, 2026 12:42

jamby77 added a commit that referenced this pull request May 14, 2026

docs(monitor): add PR #186 capture-diff follow-ups

6752451

KIvanow approved these changes May 14, 2026

View reviewed changes

Base automatically changed from feature/monitor-scheduled-captures-ui to master May 14, 2026 18:31

jamby77 merged commit 0772a38 into master May 14, 2026
2 checks passed

jamby77 deleted the feature/monitor-capture-diff branch May 14, 2026 18:37

github-actions Bot locked and limited conversation to collaborators May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(monitor): capture-vs-capture diff endpoint + compare UI#186

feature(monitor): capture-vs-capture diff endpoint + compare UI#186
jamby77 merged 32 commits into
masterfrom
feature/monitor-capture-diff

jamby77 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamby77 commented May 12, 2026

Summary

Backend

Frontend

Tests

How to verify

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants