feature(monitor): cluster fan-out with partial-failure handling (PR 14b of 14) by jamby77 · Pull Request #178 · BetterDB-inc/monitor

jamby77 · 2026-05-11T12:41:57Z

Summary

PR 14b of the split Phase 4 in docs/plans/specs/monitor-command/plan-implementation.md. Stacked on top of #177 (PR 14a — per-node selector). Closes the cluster story. A single session can fan out to one CaptureWriter per cluster primary, all writing into one logical sessionId with per-node attribution. One node disconnecting mid-capture marks that segment failed and the other writers keep running.

Schema

capture_sessions.node_segments (TEXT JSON / JSONB) — array of {nodeId, address, status, byteCount, lineCount, endedAt, terminationReason}. Idempotent ALTER ADD COLUMN for both sqlite + postgres.
capture_chunks.node_id (TEXT, nullable) — per-chunk node attribution. NOT in PK; namespace collision is avoided by per-writer chunk_index ranges (each writer gets [N*10_000_000, (N+1)*10_000_000)).

CaptureWriter

New options: nodeId (chunk attribution), startChunkIndex (chunk_index namespace), skipSessionFinalize (fan-out writers don't finalize the session row; the orchestrator aggregates).

MonitorCaptureService

StartSessionInput.fanOut: boolean.
When fanOut && cluster: resolveFanOutNodes() returns healthy primaries via ClusterDiscoveryService, the service opens one writer per node with its own chunk_index range.
Partial-failure: a writer that fails (source error, factory reject, or terminate('failed', ...)) updates only its own node_segments entry; other writers keep going.
Aggregate: session status = worst-case across segments (any failed → failed; any truncated → truncated; else completed). Termination reason rolls up matching the status.
Single-node path unchanged.
ActiveSession now holds writers[] instead of a single writer; stopSession iterates and stops each.

Frontend

Start-session modal: new "Fan-out across all primaries (N nodes)" checkbox, shown only when isCluster. Disables the node selector when on. Helper text explains per-node failure semantics.
Session-detail page: header gains a "Fan-out segments (N nodes)" panel listing each node's address, status badge, line count, and termination reason when nodeSegments is present.

Test plan

SKIP_DOCKER_SETUP=true pnpm --filter api test -- --testPathPatterns "monitor|capture-sessions|health-gate|provider-detector|acl-checker|preflight|capture-writer|demo-mode|tail|monitor-line|cross-reference" → 228 tests across 15 suites, all pass
pnpm --filter api exec tsc --noEmit → exit 0
pnpm --filter web exec tsc --noEmit → exit 0
4 new fan-out service cases pass: writers opened per node + chunks attributed; per-node partial failure (source error); partial failure (factory reject); non-cluster fan-out falls back to single-node

Live verification deferred — see "Live verification gap" below.

Notes for reviewers

Live verification gap (and how to close it). docker-compose.test.yml puts the 3-node cluster in network_mode: host, which on macOS Docker Desktop does NOT expose ports to the host. From a Linux runner the cluster is reachable at localhost:6401/6402/6403 and the spec's docker kill node-2 scenario can run end-to-end; from macOS it cannot without a compose change. Two ways to close this:
- Add explicit ports: ["6401:6401", "6402:6402", "6403:6403"] to each cluster service in docker-compose.test.yml (would work cross-platform).
- Run the live test in CI (Linux) as part of the integration suite.
I've left the compose file alone in this PR to keep scope tight; happy to follow up with the port-mapping change if you'd prefer.
Chunk index namespacing (10M per writer) avoids a (session_id, chunk_index) PK collision across fan-out writers without changing the PK. A capture would have to push 10M chunks from one node to overflow the budget — far beyond any realistic session. If a future user surfaces that limit, the natural fix is to add node_id to the PK and migrate; the column already exists.
skipSessionFinalize: true is the cleanest split between writer and orchestrator I could find. The writer still owns its own chunk-flush + terminate lifecycle; only the final session-row UPDATE is centralized so per-writer races don't clobber the aggregate.
Aggregate status rules: any failed segment → failed; any truncated segment → truncated; else completed. Documented in aggregateSegmentStatus (exported pure function, testable in isolation).
Net diff: 592 lines. Heavier than budget but composition is unavoidable: schema in 3 adapters (~120) + CaptureWriter changes (~50) + service multi-writer orchestration (~200) + 4 fan-out tests (~140) + frontend toggle + session-header per-node panel (~50). Splitting further would leave intermediate PRs without verifiable surface.
getActiveWriter(connectionId) currently returns the first writer for fan-out sessions; the tail UI doesn't yet interleave lines across nodes. Cross-node line ordering belongs in a future tail-UI iteration (when the tail page gets a node filter).

Stacked PR

Base branch is feature/monitor-cluster-per-node (#177), so the diff shown is ONLY PR 14b changes. Together with #177, Phase 4 is complete pending the live cluster verification.

PR 14b of the split Phase 4. Closes the cluster story: a single session can fan out to one CaptureWriter per cluster primary, all writing into one logical sessionId with per-node attribution. One node disconnecting mid-capture marks that segment failed and the other writers keep running. Schema: - capture_sessions.node_segments (TEXT JSON / JSONB) — array of {nodeId, address, status, byteCount, lineCount, endedAt, terminationReason}. Idempotent ALTER ADD COLUMN for both sqlite + postgres. - capture_chunks.node_id (TEXT, nullable) — per-chunk node attribution. NOT in PK; namespace collision is avoided by per-writer chunk_index ranges (see below). CaptureWriter: - New options: nodeId (chunk attribution), startChunkIndex (per-writer chunk_index namespace), skipSessionFinalize (fan-out writers don't finalize the session row; the orchestrator aggregates). MonitorCaptureService: - StartSessionInput gains fanOut: boolean. - When fanOut && cluster: resolveFanOutNodes() returns healthy primaries via ClusterDiscoveryService, the service opens one writer per node with chunk_index ranges [0, NS), [NS, 2*NS), [2*NS, 3*NS), ... where NS = 10_000_000. Each writer gets skipSessionFinalize:true; the orchestrator collects per-writer results and aggregates. - Partial-failure: a writer that fails (source error, factory reject, or terminate('failed', ...)) updates only its own node_segments entry; other writers keep going. - Aggregate: session status = worst-case across segments (any failed → failed; any truncated → truncated; else completed). Termination reason rolls up matching the status. - Single-node path unchanged. - ActiveSession now holds writers[] instead of a single writer; stopSession iterates and stops each. Frontend (start-session-modal): - New "Fan-out across all primaries (N nodes)" checkbox, shown only when isCluster. Disables the node selector when on. - Helper text: "One MONITOR connection per primary. Per-node status is recorded; one node failing mid-capture does not stop the others." MonitorSession page: - Header gains a "Fan-out segments (N nodes)" panel listing each node's address, status badge, line count, and termination reason when nodeSegments is present. Tests (4 new fan-out cases, all green): - Opens N writers, attributes chunks per node, aggregates segments on terminate with the right line/byte totals. - Partial failure: node B source errors mid-capture → segment marked failed with source_error reason; node A still completes; aggregate status = failed. - monitor_open_failed on one node up front: factory rejects; segment marked failed; other writers run normally; aggregate status = failed. - Non-cluster fanOut request falls back to single-node start (nodeSegments stays undefined). Total backend suite: 228 tests across 15 suites. Live verification: blocked on macOS because docker-compose.test.yml uses network_mode: host (Linux-only). The test cluster is reachable from a Linux runner; once a docker-compose port-mapping change lands, the spec's verification (docker kill node-2 mid-fan-out) can run from the dev box. Unit tests cover every fan-out behaviour in full; flagged for the follow-up that adds the compose port mapping. Part of PR 14 of 25 (14b of 14) in docs/plans/specs/monitor-command/plan-implementation.md (Phase 4 complete pending the live test).

jamby77 force-pushed the feature/monitor-cluster-per-node branch from 68e5d64 to d1b2376 Compare May 12, 2026 11:12

jamby77 force-pushed the feature/monitor-cluster-fanout branch from 6f1f007 to e4e2c8a Compare May 12, 2026 11:12

jamby77 force-pushed the feature/monitor-cluster-per-node branch from d1b2376 to b90faee Compare May 12, 2026 12:39

jamby77 closed this May 12, 2026

jamby77 force-pushed the feature/monitor-cluster-fanout branch from e4e2c8a to b90faee Compare May 12, 2026 12:39

github-actions Bot locked and limited conversation to collaborators May 12, 2026

jamby77 reopened this May 13, 2026

jamby77 force-pushed the feature/monitor-cluster-per-node branch from c9288a1 to f401307 Compare May 13, 2026 12:35

jamby77 force-pushed the feature/monitor-cluster-fanout branch from bd141cb to 5c06fd8 Compare May 13, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(monitor): cluster fan-out with partial-failure handling (PR 14b of 14)#178

feature(monitor): cluster fan-out with partial-failure handling (PR 14b of 14)#178
jamby77 wants to merge 1 commit into
feature/monitor-cluster-per-nodefrom
feature/monitor-cluster-fanout

jamby77 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamby77 commented May 11, 2026

Summary

Schema

CaptureWriter

MonitorCaptureService

Frontend

Test plan

Notes for reviewers

Stacked PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant