feature(monitor): capture-vs-capture diff endpoint + compare UI#186
Merged
Conversation
- Add empty apps/api/src/monitor/ NestJS module wired into app.module.ts
- MonitorDevPreviewGuard returns 404 unless MONITOR_DEV_PREVIEW=true
- GET /api/monitor/_ping returns { ok: true } when gate is open
- Frontend nav item + /monitor route conditionally rendered behind
VITE_MONITOR_DEV_PREVIEW build flag; placeholder Monitor page added
- Document both env vars in .env.example
- Unit tests for guard (allow / unset / non-true variants) and controller
First slice of the MONITOR feature per
docs/plans/specs/monitor-command/plan-implementation.md (PR 1 of 25).
All subsequent monitor work lands hidden behind these flags until launch.
…ndpoint - Add StoredCaptureSession, CaptureSessionStatus, CaptureSessionSource and CaptureSessionQueryOptions types in @betterdb/shared - Extend StoragePort with saveCaptureSession, getCaptureSession, getCaptureSessions - Implement schema (capture_sessions + capture_chunks with indexes) and the three methods in sqlite, postgres, and memory adapters - New MonitorCaptureService thin wrapper around the storage methods - GET /monitor/sessions returns [] when no sessions exist; supports connectionId / limit / offset query params; gated by MonitorDevPreviewGuard - Adapter round-trip tests cover sqlite + memory (filter / pagination / order) - Controller test verifies query-param forwarding to the service Schema columns mirror docs/plans/specs/monitor-command/spec-monitor-command.md under "Persistence". capture_chunks is created now to keep schema work in one place; CaptureWriter populates it in PR 5. Verification: GET /monitor/sessions returns []; sqlite ".schema capture_sessions" / ".schema capture_chunks" show both tables with the documented columns and indexes. Part of PR 2 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.
69af567 to
b5b888f
Compare
2368b97 to
436326d
Compare
…fset Self-review fix: - listSessions now 400s when connectionId is missing instead of returning capture sessions across all connections (cross-tenant leak per the multi-connection convention in CLAUDE.md). - limit/offset go through parsePositiveInt so ?limit=abc can no longer bind NaN into the storage query. Cap limit at 1000 to bound payload size.
The HealthGate decides whether an automated MONITOR capture (anomaly-
triggered or scheduled) should fire when its trigger fires, or be skipped
because the instance is already in distress. Manual sessions are not
gated; they only get a pre-flight warning.
- New pure module health-gate.ts: evaluateHealthGate(signals, thresholds)
→ { allow, skipReason?, signals, thresholds } with reason ordering
memory > recent OOM > failover > replication lag
- thresholdsFromEnv reads MONITOR_MEMORY_PCT_THRESHOLD (integer percent)
and MONITOR_REPLICATION_LAG_BYTES; defaults 85% / 10 MB
- Composing HealthGateService pulls a fresh INFO snapshot via
ConnectionRegistry, computes memoryPct and replication lag inline,
and counts recent OOM-correlated and replication-role anomaly events
via StoragePort.getAnomalyEvents (no proprietary coupling)
- Diagnostic endpoint GET /monitor/_diag/health-gate?connectionId=X
returns the full result (allow + reason + signals + thresholds);
follows the apps/api/src/system/system.controller.ts shape
- Full unit coverage: pure-module tests across all four signals and
reason ordering; service tests across each signal source with mocked
client + storage
Verification: with healthy local Valkey,
GET /monitor/_diag/health-gate?connectionId=env-default
→ { allow: true, signals: { memoryPct: 0, ... } }
With MONITOR_MEMORY_PCT_THRESHOLD=0 and a forced maxmemory limit on
Valkey, the same endpoint returns
{ allow: false, skipReason: "memory_above_threshold", ... }.
Part of PR 3 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md.
b5b888f to
3ccd9e4
Compare
436326d to
0af237a
Compare
3ccd9e4 to
043808e
Compare
0af237a to
0e32f01
Compare
Replace silent module-load env parsing in health-gate.ts/service.ts with Zod-validated entries in env.schema.ts. Misconfigured values (non-integer, out-of-range, negative) now fail at boot via validateEnv instead of silently falling back to defaults. - Add MONITOR_RECENT_OOM_WINDOW_MS, MONITOR_RECENT_FAILOVER_WINDOW_MS, MONITOR_MEMORY_PCT_THRESHOLD, MONITOR_REPLICATION_LAG_BYTES to env schema with z.coerce.number().int() validation. - Drop thresholdsFromEnv, parsePercent, parsePositiveInt from health-gate.ts; pure module no longer touches env. - HealthGateService injects ConfigService and resolves windows and thresholds in the constructor. - Drop thresholdsFromEnv test block; pure-function tests stay. - Mock ConfigService in service spec.
Pre-flight surfaces everything an operator needs to know before starting a MONITOR session: provider-specific restrictions, current ACL state with the exact remediation snippet if +monitor is missing, the health-gate decision for the moment, and a projected throughput / capture size for the chosen duration. - ProviderDetector (pure): host suffix first (.cache.amazonaws.com, .upstash.io, .redislabs.com, etc.), INFO server fields second, falls back to 'self-hosted' rather than 'unknown' so dev instances do not show scary warnings. Restrictions are documentation strings, not blockers. - AclChecker: probes ACL WHOAMI then ACL GETUSER, recognises +monitor / +@ALL / +@dangerous / allcommands grants and -monitor revocations, returns the exact ACL SETUSER snippet when MONITOR is missing. - PreflightService composes ProviderDetector + AclChecker + HealthGateService + a single INFO snapshot for throughput numbers. Average MONITOR-line size is a conservative 120 B; documented inline. - POST /monitor/sessions/preflight {connectionId, durationMs?} returns the composed report. Default durationMs is 30000 (matches the future start-session modal default). - Full unit coverage: ProviderDetector (host + INFO + precedence + empty inputs), AclChecker (every grant pattern, RESP2/RESP3 shapes, fallback paths), PreflightService (composition + duration default + missing-stats edge case). Verification: against the local dev Valkey, POST /monitor/sessions/preflight {connectionId:"env-default"} returns hasMonitor:true (+@ALL). After ACL SETUSER default -monitor the same call returns hasMonitor:false with setUserSnippet: "ACL SETUSER default +monitor" and the rawRules show "+@ALL -monitor". Bug found and fixed during live verification: the AclChecker initially called client.call('ACL', 'WHOAMI') with positional varargs but the DatabasePort signature is call(command, args[]); the live server returned garbage and we conservatively fell back to hasMonitor:false. Fixed and covered by the corrected mock in the unit tests. Part of PR 4 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.
Self-review fix: matchByHost iterates HOST_SUFFIXES with endsWith and short-circuits on the first match. Listing the shorter .cache.amazonaws.com before .serverless.cache.amazonaws.com made the serverless entry unreachable. Today both map to aws-elasticache so no behavioural bug, but if serverless-specific restrictions are ever added they would silently never apply.
PR 5a of the split PR 5. Lands the writer engine + the storage-side methods it needs. No Valkey integration and no endpoints yet — those land in PR 5b which builds on this. CaptureWriter contract: - Source-agnostic: consumes any MonitorSource that emits 'line' / 'error' / 'end' events. Tests pass an EventEmitter; the iovalkey adapter in PR 5b will wrap a MONITOR connection in the same shape. - Buffers lines into a "current chunk" and flushes when either the line threshold (default 5000) or the flush interval (default 1000ms) is hit. Flushes go through a serialized write queue so the source-line callback never awaits storage — keeps lines moving while disk is slow. - Enforces caps server-side: byteCap / lineCap → status='truncated' with reason 'byte_cap' or 'line_cap'; durationMs → status='completed' with reason 'duration_cap'; external stop() → 'completed' with the supplied reason; source 'error' → 'failed' with the error message. - Maintains an in-memory ring buffer (default 10 000 lines) for tail readers; viewers read snapshots and never block the writer. - On terminate, flushes pending data, drains the write queue, and patches the session row with status / endedAt / final counters / termination reason. Storage extensions (StoragePort + sqlite + postgres + memory): - saveCaptureChunk(chunk) — inserts one capture_chunks row - updateCaptureSession(id, patch) — partial update of mutable fields (status, endedAt, durationMs, byteCount, lineCount, terminationReason) - getCaptureChunks(sessionId) — chunk_index ASC Shared types: StoredCaptureChunk + CaptureSessionPatch. Verification (unit only — endpoint live verification lands with PR 5b): - 19-case CaptureWriter spec covers happy path, all 4 cap modes, chunk ordering under slow storage, ring-buffer eviction, viewer-doesn't- block-writer concurrency, error and idempotent-stop paths, and finalization despite saveCaptureChunk rejection. - capture-sessions adapter spec extended with update + chunk round-trip cases for sqlite + memory adapters (postgres covered manually). Total monitor + capture suite: 113 tests across 10 suites, all green. Part of PR 5 of 25 in docs/plans/specs/monitor-command/plan-implementation.md, split into PR 5a (this) + PR 5b (Valkey integration + endpoints + concurrency + demo-guard wire-up).
…iring
PR 5b of the split PR 5. Builds on PR 5a's CaptureWriter to deliver an
end-to-end MONITOR capture flow exposed via REST.
- iovalkey-monitor-source.ts wraps iovalkey's MONITOR mode in our
source-agnostic MonitorSource contract: client.monitor() opens a NEW
dedicated connection (the originating client stays usable), events are
formatted to the standard `<time> [<db> <addr>] "<arg>" ...` text
shape, and stop() disconnects only the dedicated connection.
- MonitorCaptureService grows startSession / stopSession / getSession
on top of an in-memory active-session map keyed by connectionId.
Concurrency contract: at most one active session per connection;
duplicate startSession on the same connection throws ConflictException
(HTTP 409). On writer completion the active map clears automatically.
setMonitorSourceFactory is a test seam that lets specs inject a fake
source instead of opening a real Valkey connection.
- Controller endpoints:
- POST /monitor/sessions {connectionId, durationMs?, byteCap?,
lineCap?, requestedBy?} → 201 with the session row (status='running')
- GET /monitor/sessions/:id → 200 with the session row, or 404
- DELETE /monitor/sessions/:id → 200 with the finalized row, or 404
- DemoModeGuard adds /monitor/sessions to DENIED_MUTATION_PREFIXES so
POST/DELETE are blocked on the demo host while GET stays allowed for
read-only browsing of seeded sessions. Three new guard tests cover
the prefixes.
- New service spec: 11 cases covering insert + active registration,
409 on conflict, allow-after-stop, byteCap/lineCap overrides, the
monitor_open_failed failure path (writes status='failed' and clears
the active map), stopSession finalize, getSession round-trip, and
active-writer lifetime.
Verification (live, non-demo host):
- POST /monitor/sessions {connectionId:"env-default", durationMs:5000}
→ {id, status:"running", byteCap:52428800, lineCap:5000000}
- valkey-cli -r 100 SET key value (in another terminal)
- POST again on same connection → 409 Conflict
- After 5s, GET /monitor/sessions/:id →
{status:"completed", lineCount:109, byteCount:6242,
terminationReason:"duration_cap"}
- sqlite3 capture_chunks count for that session: 5
- DELETE while running terminates with reason "manual_stop"
- POST without connectionId → 400; GET unknown id → 404
Demo-host verification is unit-only because the live cloud-auth flow
returns 302 for unauthenticated requests, which prevents reaching the
demo gate without real session cookies. The DemoModeGuard's behaviour
for /monitor/sessions is fully covered in
proprietary/cloud-auth/demo-mode.guard.spec.ts.
Total monitor + capture + demo suites: 137 tests across 12 suites,
all green.
Part of PR 5 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md (5b of 5).
…ncated) Three new community-tier webhook event types fire on the capture session lifecycle so external systems (incident tooling, Slack, etc.) can react to capture activity without polling. - WebhookEventType.MONITOR_SESSION_STARTED - WebhookEventType.MONITOR_SESSION_COMPLETED - WebhookEventType.MONITOR_SESSION_TRUNCATED Registered in FREE_EVENTS and WEBHOOK_EVENT_TIERS (community). Dispatch wired into MonitorCaptureService: - session.started fires after the row is persisted at start time; scoped to the originating connectionId so webhooks can subscribe per-instance via the existing scope mechanism - session.completed fires when the writer ends with status='completed' (duration cap, manual stop, source ended naturally) - session.truncated fires when the writer ends with status='truncated' (byte cap or line cap) - 'failed' status does NOT dispatch a community event; PR 16 will add monitor.session.skipped for the related Pro+ case Payload includes sessionId, source, optional triggerId / scheduleId, requestedBy, startedAt, endedAt, durationMs, byteCount, lineCount, terminationReason, byteCap, lineCap. Connection enrichment is handled by the existing dispatcher. Side fixes: - DEFAULT_BYTE_CAP and DEFAULT_LINE_CAP are now overridable via MONITOR_DEFAULT_BYTE_CAP / MONITOR_DEFAULT_LINE_CAP env vars (so truncation can be exercised end-to-end without a 50 MB capture); community defaults preserved when unset. - ActiveSession.donePromise now stores the FULL chained promise (start → dispatch → cleanup), so stopSession's await reliably waits for the active map to clear and for the webhook to fire before returning. Verification (live, with a localhost catcher on port 4567): - 3s session against an idle Valkey → received monitor.session.started then monitor.session.completed with terminationReason='duration_cap' - 30s session with MONITOR_DEFAULT_BYTE_CAP=4096 + 200 SETs → received monitor.session.started then monitor.session.truncated with terminationReason='byte_cap' Total monitor + capture suite: 141 tests across 12 suites, all green (4 new dispatch cases). Part of PR 6 of 25 in docs/plans/specs/monitor-command/plan-implementation.md.
…ubscriptions
End-to-end live tail for capture sessions, mirroring the established
CliGateway pattern at the new /monitor/ws route. Covers active sessions
(stream from the writer's ring buffer + new lines) and historical
sessions (replay persisted chunks).
CaptureWriter additions:
- subscribe(cb): independent per-viewer line listeners; throwing in a
subscriber never affects the writer or other subscribers.
- onEnd(cb): one-shot termination notification; fires asynchronously if
the writer has already terminated when subscribed.
- subscribers cleared on terminate so closed viewers do not leak.
TailGateway:
- /monitor/ws upgrade routed in main.ts alongside /cli/ws.
- handleUpgrade enforces the dev-preview gate AND rejects connections
whose Host header matches DEMO_HOSTNAME (the HTTP DemoModeGuard does
not run on WS upgrades, so the gateway must do it itself).
- Active session: streams the existing ring-buffer backlog so a viewer
joining mid-session has immediate context, then subscribes for new
lines. On writer end, sends {type:'status', status:'session_ended'}
and closes.
- Historical session: streams persisted chunks line-by-line then sends
{type:'status', status:'historical_complete'} and closes.
- Per-viewer pause / resume: pause buffers lines server-side; resume
drains in original order. Buffer is bounded (50 000 lines) — oldest
lines drop if a paused viewer falls hopelessly behind, preventing a
runaway-memory bug from one disconnected viewer.
- Multiple concurrent viewers receive the same stream independently;
closing one viewer does NOT affect the writer or other viewers.
Verification (live):
- Two parallel WebSocket viewers on a 6s session both receive 41
identical lines and a 'session_ended' status frame.
- Pause → 5 SETs against Valkey → resume yields 7 lines including the
paused-traffic SETs in original order.
- A WS connection to a fully-completed session streams 10 persisted
lines and a 'historical_complete' status.
- WS connection without sessionId is rejected at handshake.
- Demo-host live test is unit-only (per-PR-5b precedent — live cloud
auth needs real session cookies); two new gateway-spec cases cover
it. tail.gateway.spec.ts covers all six contract points (gates,
session-not-found, historical replay, live backlog + lines + close,
pause-buffer drain in order, unsubscribe on close, invalid JSON
message).
Total monitor + capture suite: 155 tests across 13 suites, all green.
Part of PR 7 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md.
First Phase 2 slice. The /monitor route renders a read-only Sessions
table scoped to the currently selected connection, polling
GET /monitor/sessions every 5s so curl-started sessions appear without
a manual refresh. The placeholder Monitor.tsx from PR 1 is replaced
with the real page.
- New apps/web/src/api/monitor.ts API client (listSessions only;
start/stop wiring lands in PR 9 with the start-session modal).
- Sessions table columns: started timestamp, status badge, source,
duration, line count, byte count, termination reason, requestedBy.
Sizes and durations human-formatted; status badge colour-codes
running / completed / truncated / failed / skipped to match the
spec semantics.
- One-component-per-file convention: split into
pages/Monitor.tsx (page shell), pages/monitor/sessions-table.tsx,
pages/monitor/session-status-badge.tsx.
- WebhookForm.tsx EVENT_LABELS map extended with the three new
monitor.session.* event types from PR 6 (typecheck was failing
without them because Record<WebhookEventType, string> now requires
these keys).
Verification (live, with Playwright):
- /monitor route lists 8 prior sessions from earlier PRs' testing,
with correct status badges, byte/line counts, and termination
reasons (duration_cap, byte_cap, manual_stop)
- Started a fresh 3s session via curl with requestedBy=playwright-test
→ row appeared in the table within the 5s poll window without a
manual refresh
- With VITE_MONITOR_DEV_PREVIEW unset: MONITOR nav item absent from
sidebar; /monitor route renders an empty <main> ("No routes
matched location" in vite logs)
- Screenshot at docs/assets/pr8-monitor-sessions-list.png
Part of PR 8 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md.
…firmation
A modal launched from the Monitor page lets the operator start a capture
session with full pre-flight context: provider restrictions, ACL state
(with the exact ACL SETUSER snippet when +monitor is missing), the
health-gate verdict for the moment, and a throughput / capture-size
estimate based on current ops/sec × duration.
- New components (one-per-file convention):
- pages/monitor/start-session-modal.tsx (form + lifecycle)
- pages/monitor/preflight-panel.tsx (4-section read-only display)
- API client gains preflight() and startSession()
- Monitor.tsx grows a Start session button in the header that opens
the modal; on success, invalidates the sessions list query so the
new row shows immediately rather than waiting for the next 5s poll
- Form fields:
- duration: integer + unit selector (seconds / minutes); minimum 1
- requestedBy: optional free-text, sent to the API and shown in the
sessions table
- 5-minute confirmation guard: if the chosen duration exceeds 5 min,
the first Submit click swaps the panel to a yellow warning ("Sessions
over 5 minutes can produce significant load. Confirm to proceed.")
and changes the primary button to "Yes, start session". Second click
fires the start.
- Pre-flight refreshes whenever the duration changes (so the size
estimate stays accurate as the user adjusts), via a useEffect with
cancelled-flag cleanup to avoid races.
- State reset on close: duration/unit/requestedBy/preflight/confirming/
error all clear when the modal closes, so reopening always shows
defaults. (Caught during live testing — a pre-fix run had carry-over
state from a previous open.)
- Confirmation auto-clears when duration drops back below 5 minutes.
Verification (Playwright, live):
- Click Start session → modal opens, pre-flight panel populates within
~1s with provider 'self-hosted', ACL hasMonitor:true, health:healthy,
estimated 3 lines / 144 B for the default 30s.
- Duration to 6 minutes → first Submit shows the amber confirmation
panel with the exact spec wording; primary button becomes
"Yes, start session".
- Reset to 3s + requestedBy=pr9-final → Submit → modal closes, new
row appears in the table within the 5s poll window with the
requestedBy value, status running → completed.
- Reopening the modal after Cancel shows defaults (30s, no
requestedBy) — fixes a state-carry-over bug spotted during testing.
Screenshots:
- docs/assets/pr9-start-session-modal.png (initial pre-flight)
- docs/assets/pr9-confirmation-dialog.png (5-min guard)
Part of PR 9 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md.
The Sessions table rows are now clickable links to a per-session detail
page that streams the live MONITOR output via the TailGateway. Lines
appear in real time, the Pause button freezes the local view while the
gateway buffers server-side, and Resume drains the buffered window in
order.
- New useMonitorTail(sessionId, bufferSize=5000) hook:
- Opens ws://localhost:3001/monitor/ws?sessionId=X (dev) or
/api/monitor/ws (prod, same-origin) per env.
- Accumulates incoming MONITOR text lines in a ref-buffer; flushes
a snapshot to React state at most once per animation frame so the
UI batches updates at ~60 Hz under thousands of lines/sec.
- Bounds the buffer at 5 000 lines — when it overflows the oldest
drop and bufferTrimmed flips true.
- Maintains a separate totalReceived counter (NOT bounded) so the
UI can show "N lines received · showing last K (older lines
dropped)".
- pause()/resume() send {type:'pause'|'resume'} control messages;
the server-side per-viewer buffer drains in original order.
- Status state tracks connecting → streaming → paused → session_ended
| historical_complete | closed | error.
- New TailView component: status badge, line counter, dropped-lines
notice, pause/resume controls (only when live), monospace scrollable
panel that auto-scrolls to bottom unless the user scrolls up
(followBottomRef ref).
- New MonitorSession page (route: /monitor/sessions/:id) with header
showing session id, started timestamp, source, line count, and
termination reason; live tail panel below.
- SessionsTable rows are now clickable and navigate to the detail page.
- ESLint disable for react-hooks/set-state-in-effect on the connection-
reset block: those setStates are pre-WS-open initial-state restoration,
React 18 batches them into a single commit, no cascade is possible.
Documented inline.
Verification (Playwright, live):
- Started a 5-min session, navigated to /monitor/sessions/:id, sent
80 SETs from another terminal → all 80 SET lines streamed into the
view in real time alongside ongoing INFO polling; total reached 173
lines.
- Clicked Pause → status badge flipped to amber "Paused"; tail froze
at 37 lines. Sent 100 SETs against Valkey while paused → still 37
shown. Clicked Resume → instantly drained to 278 lines, all 100
paused SETs included in original order.
Screenshots:
- docs/assets/pr10-live-tail.png (live tail mid-session)
- docs/assets/pr10-tail-after-resume.png (post-resume drain)
Part of PR 10 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md.
Self-review fixes:
- useMonitorTail cleanup detaches onopen/onmessage/onerror/onclose
before calling ws.close(). close() is async, so without nulling
the handlers the old socket's onclose can fire after a new socket
has been created and flip the new connection's status to 'closed'.
- MonitorSession used listSessions({ limit: 100 }) and filtered
client-side, which silently failed for older sessions or sessions
on a non-current connection. Switched to monitorApi.getSession(id)
(added) which hits GET /monitor/sessions/:id.
Closes Phase 2 frontend MVP. The session-detail page gains a Filters &
export panel that filters the captured stream by command, client,
key glob, and time window — then downloads the filtered slice as JSON
or CSV via a new server endpoint.
Backend:
- monitor-line.parser.ts: pure module
- parseMonitorLine(text) → {ts, tsRaw, db, addr, cmd, args, key, raw}
Handles backslash-escaped quotes, IPv6 bracket addresses, keyless
commands (PING etc.).
- matchesFilters(line, {command, client, key, afterTs, beforeTs}):
case-insensitive command exact match, addr substring match, glob
match on the first arg / key (* and ?), inclusive timestamp window.
- lineToCsvRow + CSV_HEADER: RFC 4180 escaping for commas, quotes,
newlines.
- GET /monitor/sessions/:id/export?format=json|csv (+ filter query
params) streams persisted chunks, parses each line, applies filters,
and emits either {sessionId, count, lines[]} JSON or a CSV with a
header row. Content-Disposition: attachment; filename=
monitor-session-<id>.<fmt>. 404 on unknown session. Default format
is json when an unknown value is supplied.
Frontend:
- New FiltersAndExport component on the session-detail page (below
the live tail).
- 5 inputs: Command, Client, Key glob, After (datetime-local),
Before (datetime-local).
- In-page preview count derived from the live-tail buffer (parser
inlined to avoid a server round-trip on every keystroke). Export
buttons hit the server endpoint which sees the full session.
- Two <a download> buttons styled as outline Buttons — clean
browser-native download with the URL carrying current filters.
- useMonitorTail lifted from TailView up to MonitorSession so the
Filters panel and the TailView share one buffer.
Tests:
- monitor-line.parser.spec.ts: 22 cases covering all parser edge cases
(escaped quotes, backslashes, IPv6, keyless commands, malformed
input), all filter axes including glob wildcards and AND semantics,
and CSV escaping.
- monitor.controller.spec.ts extended with 5 export cases (404,
unfiltered JSON, command-filtered, CSV header, format fallback).
Total backend suite: 184 tests across 14 suites, all green (1 caught
during testing: IPv6 bracket address parsing — the initial header
regex was too greedy and stopped at the inner `]`; fixed with a
non-greedy capture anchored on `\]\s+"`).
Verification (Playwright, live):
- Started a 5s session, sent 30 SETs against foo + 30 GETs against
user:* → session captured 70 lines (mix of AUTH/INFO/PING + SETs
+ GETs).
- Direct curl to /export?format=json → count: 70, cmds:
['AUTH', 'GET', 'INFO', 'PING', 'SET'].
- Filter ?command=GET → count: 30.
- Filter ?key=user:* → count: 30, all GET user:N.
- CSV export starts with header `ts,ts_raw,db,addr,cmd,args,key`.
- UI: typed `GET` and `user:*` into the form → preview text
"Buffer match: 30 of 70 lines. Export uses the full session,
server-side." matched the API. Export-link URLs updated with the
filter query string in real time.
Screenshot at docs/assets/pr11-filters-export.png.
Part of PR 11 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md (closes
Phase 2 frontend MVP).
Self-review fixes: - Filter inputs (command/client/key) now go through trimmedOrUndefined so the buffer-preview client behaviour (which trims) and the export endpoint behaviour stay in sync. Without this, 'GET ' could match in the preview but return 0 rows in the export. - Key glob filters are now bounded to 128 characters via cappedKeyFilter to defuse catastrophic backtracking via patterns like '*a*a*a*a*a*' — globToRegex compiles to a non-anchored .* chain that is ReDoS-prone against equally long captured keys.
The diagnostic differentiator the spec calls out. Diffs a completed
capture against the connection's recent history along four axes —
new command shapes, hot-key delta, slowlog regressions, and audit
entries — using only data already persisted by the rest of the
platform (slowlog, commandlog, client snapshots, audit trail).
CrossReferenceEngine is a single Injectable class with one public
method so it can be reused for the Pro+ capture-vs-capture diff in
PR 23 by swapping the baseline source.
Four dimensions:
- newShapes: shapes seen in capture but not in the union of slowlog +
commandlog + client-analytics.cmd over the baseline window.
- Non-scripted shape = "VERB:arity" (e.g. "LPUSH:2").
- Scripted commands (EVAL / EVALSHA / FCALL / FCALL_RO) preserve
the script SHA or function name in the shape so newly-deployed
scripts surface as new shapes.
- Client-snapshot.cmd records only the verb; encoded as "VERB:*"
so a verb seen in client analytics covers any arity from the
capture side.
- hotKeyDelta: top-50 first-arg keys in capture vs top-50 from
slowlog args in the baseline window.
- newInTopK: keys present in capture top-K, absent from baseline
top-K.
- rankChanges: keys present in both with a different rank.
- slowlogRegressions: per captured verb, observed slowlog rate
during the session window vs the 95th-percentile of per-verb rates
in the baseline window. Flagged when observed > p95. Empty-baseline
edge case: any session slowlog appearance flags.
- aclDeltas: audit-trail entry count within the session window
(drives the UI for the audit module when in use). INFO counter
deltas (acl_access_denied_auth, rejected_connections) are
placeholders that need start/end snapshots — flagged for a
follow-up.
Baseline windows: 6h | 24h | 7d | same-hour-last-week, default 24h.
New endpoint:
GET /monitor/sessions/:id/cross-reference?baseline=24h
- 400 on unknown baseline (must be one of the four values).
- 404 on unknown session.
Tests (33 cases, all green):
- All four baseline windows (computeBaselineRange).
- Shape encoding rules including scripted SHA/function-name
preservation, 0-arg commands, and shapeOfStringArray symmetry.
- All four cross-reference dimensions in isolation (computeNewShapes
/ computeHotKeyDelta / computeSlowlogRegressions) including the
empty-baseline edge case.
- Engine integration with mocked storage covering: 404, spec
verification example, all four baseline windows, client-snapshot
VERB:* coverage, audit-entry surfacing.
Total backend suite: 217 tests across 15 suites.
Verification (live, against running dev API + dev Valkey):
- Seeded a slowlog row with command=["GET","history-key"], timestamp
10 minutes before session start.
- Captured a session with one GET foo + one LPUSH x v.
- GET /monitor/sessions/:id/cross-reference?baseline=24h returns:
- newShapes: [AUTH:1, LPUSH:2, PING:0] (GET:1 correctly absent
because baseline slowlog covers it; AUTH/PING are real
background commands from the BetterDB poller, also not in
baseline)
- hotKeyDelta: foo + x appear in newInTopK
- baseline: { window: 24h, startTs, endTs }
- GET /cross-reference?baseline=15m → 400 with the exact error
message.
- GET /cross-reference for an unknown session id → 404.
Part of PR 12 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md (Phase 3).
Wires the cross-reference engine from PR 12 into the session-detail
page. Four-section panel displays new command shapes, hot-key delta,
slowlog regressions, and ACL / audit deltas. A baseline selector
swaps between 6h / 24h / 7d / same-hour-last-week and triggers a
fresh compute via react-query.
- New CrossReferencePanel component (one per file).
- 4 Section cards in a 2x2 grid:
- New command shapes: counted, with a script badge for EVALSHA /
FCALL / FCALL_RO rows; empty-state copy "every captured command
was seen in baseline".
- Hot-key delta: two sub-lists — newInTopK (rank #N) and
rankChanges (#baseline ↑/↓ #capture with arrow); empty-state
"No hot-key shifts".
- Slowlog regressions: per verb, observed/s vs baseline p95/s;
empty-state explains the meaning.
- ACL / audit deltas: audit count + INFO counter deltas, with a
note that the counter delta is pending session-boundary
snapshots (follow-up PR).
- Baseline selector is a <select> bound to react-query so caching
works per (sessionId, baseline) tuple — switching back and forth
is instant after the first compute. A small "refreshing…"
indicator shows when the query is refetching in the background.
- API client extended with crossReference(sessionId, baseline) and
the full CrossReferenceResult shape mirrored from the backend.
Verification (Playwright, live):
- Seeded slowlog with a GET row 10 min before session start.
- Captured a session with GET foo (×5) + LPUSH x v (×5).
- /monitor/sessions/:id renders all 4 sections.
- 24h baseline → newShapes shows LPUSH:2, AUTH:1, PING:0 (GET:1
correctly absent — baseline covers it). Hot keys foo + x flagged
in newInTopK.
- Switching baseline to same-hour-last-week → GET:1 flips back into
newShapes (the seeded row is 10 min ago, not in last week's
one-hour window).
Screenshots:
- docs/assets/pr13-cross-reference-24h.png
- docs/assets/pr13-cross-reference-same-hour-last-week.png
Part of PR 13 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md (Phase 3,
closes the cross-reference frontend).
PR 14a of the split Phase 4. Lands per-node selection end-to-end:
the start-session modal queries cluster topology and renders a
node dropdown when the target is a cluster; the API opens a
dedicated MONITOR connection to the chosen node via the existing
ClusterDiscoveryService.
Fan-out + partial-failure handling is split into a follow-up PR
14b. Reason: fan-out requires multi-writer orchestration, a
node_id column on capture_chunks, and per-node status persistence
on capture_sessions — each non-trivial. Per-node alone is what
unblocks the spec's main UX promise (the dropdown) and gives us a
working cluster path without rewriting CaptureWriter's 1:1
source-to-writer assumption.
Backend:
- StoredCaptureSession gains optional targetNode (host:port string,
null for non-cluster sessions).
- capture_sessions schema: new target_node TEXT column; idempotent
ALTER added for existing sqlite + postgres deployments.
- StartSessionInput accepts optional targetNodeId; MonitorCapture
Service resolves it to the node address for persistence and
hands it to the monitorSourceFactory.
- monitorSourceFactory grows a second targetNodeId parameter; the
default factory uses ClusterDiscoveryService.getNodeConnection
to open a dedicated iovalkey client to that node, then wraps in
the existing iovalkey-monitor-source.
- New GET /monitor/connections/:id/nodes endpoint: returns
{ isCluster, nodes: [{id, address, role, healthy}] }. Reports
isCluster:false (with empty nodes) for both single-instance
connections AND for cluster-discovery failures, so the frontend
hides the dropdown gracefully when CLUSTER NODES is not
supported.
Frontend:
- start-session-modal.tsx queries /monitor/connections/:id/nodes
when the modal opens. If the response reports isCluster:true,
renders a Cluster node <select> auto-defaulted to the first
master node. The targetNodeId is passed in the POST body.
- Selector hidden completely for non-cluster connections; the
rest of the modal is unchanged.
- A small caption under the selector tells the user that MONITOR
is per-node and that fan-out lands in a follow-up.
Tests (3 new service cases + 4 new controller cases, all green):
- Service records target_node when a cluster node id is supplied
and resolved via discovery.
- Service passes targetNodeId through to the monitor source
factory.
- Service falls back to the supplied id when discovery throws.
- Controller exposes nodes as {isCluster:false, nodes:[]} for an
empty/no-cluster discovery; {isCluster:true} with descriptor
list otherwise; treats discovery error as non-cluster.
- Controller forwards targetNodeId through to the service.
Total backend suite: 224 tests across 15 suites.
Verification (live, dev compose is single-Valkey):
- GET /monitor/connections/env-default/nodes → {isCluster:false,
nodes:[]} (single-Valkey is correctly reported as non-cluster).
- Browser: click Start session → modal opens with NO cluster-node
selector (only Duration + Requested by). Verified via
document.getElementById('targetNode') === null.
- Existing single-instance start flow continues to work
unchanged.
A 3-node cluster live test will land alongside PR 14b — fan-out
needs the same docker compose extension and partial-failure
verification depends on it, so I'm grouping them.
Screenshot at docs/assets/pr14-modal-non-cluster.png.
Part of PR 14 of 25 in
docs/plans/specs/monitor-command/plan-implementation.md (Phase 4,
per-node half).
PR 14b of the split Phase 4. Closes the cluster story: a single
session can fan out to one CaptureWriter per cluster primary, all
writing into one logical sessionId with per-node attribution.
One node disconnecting mid-capture marks that segment failed and
the other writers keep running.
Schema:
- capture_sessions.node_segments (TEXT JSON / JSONB) — array of
{nodeId, address, status, byteCount, lineCount, endedAt,
terminationReason}. Idempotent ALTER ADD COLUMN for both
sqlite + postgres.
- capture_chunks.node_id (TEXT, nullable) — per-chunk node
attribution. NOT in PK; namespace collision is avoided by
per-writer chunk_index ranges (see below).
CaptureWriter:
- New options: nodeId (chunk attribution), startChunkIndex
(per-writer chunk_index namespace), skipSessionFinalize
(fan-out writers don't finalize the session row; the
orchestrator aggregates).
MonitorCaptureService:
- StartSessionInput gains fanOut: boolean.
- When fanOut && cluster: resolveFanOutNodes() returns healthy
primaries via ClusterDiscoveryService, the service opens one
writer per node with chunk_index ranges
[0, NS), [NS, 2*NS), [2*NS, 3*NS), ... where NS = 10_000_000.
Each writer gets skipSessionFinalize:true; the orchestrator
collects per-writer results and aggregates.
- Partial-failure: a writer that fails (source error, factory
reject, or terminate('failed', ...)) updates only its own
node_segments entry; other writers keep going.
- Aggregate: session status = worst-case across segments
(any failed → failed; any truncated → truncated; else
completed). Termination reason rolls up matching the status.
- Single-node path unchanged.
- ActiveSession now holds writers[] instead of a single writer;
stopSession iterates and stops each.
Frontend (start-session-modal):
- New "Fan-out across all primaries (N nodes)" checkbox, shown
only when isCluster. Disables the node selector when on.
- Helper text: "One MONITOR connection per primary. Per-node
status is recorded; one node failing mid-capture does not
stop the others."
MonitorSession page:
- Header gains a "Fan-out segments (N nodes)" panel listing
each node's address, status badge, line count, and
termination reason when nodeSegments is present.
Tests (4 new fan-out cases, all green):
- Opens N writers, attributes chunks per node, aggregates
segments on terminate with the right line/byte totals.
- Partial failure: node B source errors mid-capture → segment
marked failed with source_error reason; node A still
completes; aggregate status = failed.
- monitor_open_failed on one node up front: factory rejects;
segment marked failed; other writers run normally; aggregate
status = failed.
- Non-cluster fanOut request falls back to single-node start
(nodeSegments stays undefined).
Total backend suite: 228 tests across 15 suites.
Live verification: blocked on macOS because
docker-compose.test.yml uses network_mode: host (Linux-only).
The test cluster is reachable from a Linux runner; once a
docker-compose port-mapping change lands, the spec's verification
(docker kill node-2 mid-fan-out) can run from the dev box. Unit
tests cover every fan-out behaviour in full; flagged for the
follow-up that adds the compose port mapping.
Part of PR 14 of 25 (14b of 14) in
docs/plans/specs/monitor-command/plan-implementation.md (Phase 4
complete pending the live test).
Pro+ anomaly-driven captures. CaptureTriggerRegistry is a deep module: subscribes to anomaly events (polling), matches against configured triggers, fires via MonitorCaptureService through the health gate, queues when busy, expires on schedule. - New capture_triggers table on sqlite + postgres + memory - StoredCaptureTrigger / CaptureTriggerStatus types in @betterdb/shared - Feature.MONITOR_ANOMALY_TRIGGER (Pro+) added to TIER_FEATURES - POST/GET/DELETE /monitor/triggers, license-gated - /monitor/triggers added to DemoModeGuard.DENIED_MUTATION_PREFIXES - Full unit suite: dedup, cancel, fire, single-fire, queue-when-busy, health-gate skip, expiry, concurrency, controller endpoint contracts
Self-review fixes for the trigger registry: - Constructor seeds lastAnomalyAt to Date.now() so a process restart no longer replays historical anomalies as if they just arrived; resetAnomalyWatermark() exposed as a test seam. - processNewAnomalies now drains paginated results (offset walk, page size 200, hard cap 5000/tick) and iterates oldest-first so a partial drain still leaves a coherent watermark — fixes the 'newest 200, silently drop the gap' drift under load. And on the triggers list endpoint: - limit/offset go through parsePositiveInt to guard against NaN binding, with the limit capped at 1000.
Two new lifecycle events fired by CaptureTriggerRegistry: - monitor.trigger.created (Pro tier) — dispatched after a successful createTrigger; payload includes trigger id, metric/anomaly type, expiry, and createdBy - monitor.session.skipped (community tier) — dispatched when the health gate denies a triggered fire; payload includes the skip reason returned by HealthGateService Dispatch is fire-and-forget so webhook failures cannot block the trigger lifecycle; errors are logged and swallowed. - Registered both events in WebhookEventType + tier lists - Wired WebhookDispatcherService into CaptureTriggerRegistry - 5 new tests covering trigger.created, session.skipped, and the no-emit cases (successful fire, queue-when-busy, dispatcher error)
Pro+ feature, gated by MONITOR_ANOMALY_TRIGGER: - New TriggersTable component: created at, status badge, source metric/anomaly, expiry (relative), fired session id, created by, Delete button on active rows - Monitor.tsx wraps Sessions in a Tabs container; adds a Triggers tab visible only when the license grants MONITOR_ANOMALY_TRIGGER - monitorApi gains listTriggers / createTrigger / cancelTrigger - WebhookForm gets labels for monitor.session.skipped and monitor.trigger.created (introduced in PR 16) - Polling refreshes triggers every 5s while the tab/page is open; cancel invalidates the query so the row disappears on success
Pro+ feature (gated by MONITOR_ANOMALY_TRIGGER + VITE_MONITOR_DEV_PREVIEW): - New CaptureOnNextModal: confirms a trigger create call with prefilled metric/anomaly summary, surfaces success and errors, invalidates the triggers query on success - Row action button on each individual anomaly event in the Recent Anomalies list and on each anomaly inside an expanded correlated group. Opens the modal prefilled with that anomaly's metric + spike/drop, scoped to the current connection - Tightened up pre-existing eslint hygiene (typed buffers, scoped Date.now to a useMemo, dropped an inline any) per CLAUDE.md
Pro+ scheduled MONITOR captures via @nestjs/schedule SchedulerRegistry:
- new scheduled_captures table on sqlite + postgres + memory
- StoredScheduledCapture / ScheduledCaptureStatus types in @betterdb/shared
- Feature.MONITOR_SCHEDULED_CAPTURES (Pro+) added to TIER_FEATURES
- CaptureScheduler deep module:
- createSchedule / deleteSchedule / listSchedules / getSchedule
- fireOnce test seam for synchronous tick verification
- per-row Nest interval (single-active-session-per-instance and
health-gate skip enforced upstream by MonitorCaptureService)
- restores enabled rows from storage on module init
- POST/GET/DELETE /monitor/schedules, license-gated
- /monitor/schedules added to DemoModeGuard.DENIED_MUTATION_PREFIXES
- Full unit suite covering create/delete, validation, health-gate
skip, session-already-active skip, start_failed, list ordering,
and controller endpoint contracts
Self-review fix: limit/offset on GET /monitor/schedules now go through the same parsePositiveInt helper used by sessions and triggers. NaN inputs no longer reach the storage adapter, and limit is capped at 1000 to bound payload size.
Backend: - Add nullable cron_expression column to scheduled_captures (sqlite + postgres + memory). CHECK constraint enforces exactly one of intervalSeconds or cronExpression - Idempotent migrations bring PR 19 deployments forward - StoredScheduledCapture / ScheduledCapturePatch carry optional cronExpression - CaptureScheduler picks cron CronJob via SchedulerRegistry.addCronJob for cron rows; setInterval-backed timer for interval rows - POST /monitor/schedules accepts either intervalSeconds or cronExpression and rejects when both or neither are supplied Frontend: - New Scheduled tab on /monitor (gated by MONITOR_SCHEDULED_CAPTURES) with SchedulesTable showing cadence, duration, last fire info, and delete action - CreateScheduleModal: amount + unit interval picker by default; Advanced toggle reveals a cron expression field; both pass through monitorApi.createSchedule Tests: - capture-scheduler.spec: cron register/unregister, validation rejection for missing/conflicting/invalid specs - monitor.controller.spec: BadRequest when both interval and cron are missing; cronExpression forwarding to the scheduler
043808e to
0100f82
Compare
Pro+ feature (Feature.MONITOR_CAPTURE_DIFF): - CrossReferenceEngine.computeCaptureDiff(sessionA, sessionB) reuses the existing pure helpers (computeNewShapes + computeHotKeyDelta) with B's parsed lines as the baseline source; slowlog and ACL axes return empty payloads (those dimensions scope to connection history, not a single capture) - New pure helpers: collectShapesFromLines, collectKeyCountsFromLines - CrossReferenceResult.baseline.window now accepts 'capture' and a baseline.sessionId field surfaces the diffed-against capture id - GET /monitor/sessions/:id/diff?vs=:otherId — license-gated; 400 on self-diff or missing vs; 404 when either session is missing - Web: new CompareCapturesPanel mounted on MonitorSession behind the feature flag. Picks any other completed capture on the same connection from a dropdown, fetches the diff lazily, and renders the shared CrossReferenceSections (now exported)
0e32f01 to
245e5e5
Compare
jamby77
added a commit
that referenced
this pull request
May 14, 2026
KIvanow
approved these changes
May 14, 2026
# Conflicts: # apps/api/src/monitor/__tests__/cross-reference.engine.spec.ts # apps/api/src/monitor/__tests__/monitor.controller.spec.ts # apps/api/src/monitor/capture-writer.ts # apps/api/src/monitor/cross-reference.engine.ts # apps/api/src/monitor/monitor.controller.ts # apps/web/src/api/monitor.ts # apps/web/src/pages/MonitorSession.tsx # apps/web/src/pages/monitor/cross-reference-panel.tsx # packages/shared/src/license/types.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 21 of the MONITOR stack. Adds the Pro+ capture-vs-capture diff so users can answer "what changed between these two captures?" without leaving the session detail page.
Backend
Frontend
Tests
How to verify
Requires `MONITOR_DEV_PREVIEW=true VITE_MONITOR_DEV_PREVIEW=true` and a Pro+ license.
Test plan