Skip to content

feature(monitor): cluster fan-out with partial-failure handling (PR 14b of 14)#178

Open
jamby77 wants to merge 1 commit into
feature/monitor-cluster-per-nodefrom
feature/monitor-cluster-fanout
Open

feature(monitor): cluster fan-out with partial-failure handling (PR 14b of 14)#178
jamby77 wants to merge 1 commit into
feature/monitor-cluster-per-nodefrom
feature/monitor-cluster-fanout

Conversation

@jamby77
Copy link
Copy Markdown
Collaborator

@jamby77 jamby77 commented May 11, 2026

Summary

PR 14b of the split Phase 4 in docs/plans/specs/monitor-command/plan-implementation.md. Stacked on top of #177 (PR 14a — per-node selector). Closes the cluster story. A single session can fan out to one CaptureWriter per cluster primary, all writing into one logical sessionId with per-node attribution. One node disconnecting mid-capture marks that segment failed and the other writers keep running.

Schema

  • capture_sessions.node_segments (TEXT JSON / JSONB) — array of {nodeId, address, status, byteCount, lineCount, endedAt, terminationReason}. Idempotent ALTER ADD COLUMN for both sqlite + postgres.
  • capture_chunks.node_id (TEXT, nullable) — per-chunk node attribution. NOT in PK; namespace collision is avoided by per-writer chunk_index ranges (each writer gets [N*10_000_000, (N+1)*10_000_000)).

CaptureWriter

  • New options: nodeId (chunk attribution), startChunkIndex (chunk_index namespace), skipSessionFinalize (fan-out writers don't finalize the session row; the orchestrator aggregates).

MonitorCaptureService

  • StartSessionInput.fanOut: boolean.
  • When fanOut && cluster: resolveFanOutNodes() returns healthy primaries via ClusterDiscoveryService, the service opens one writer per node with its own chunk_index range.
  • Partial-failure: a writer that fails (source error, factory reject, or terminate('failed', ...)) updates only its own node_segments entry; other writers keep going.
  • Aggregate: session status = worst-case across segments (any failedfailed; any truncatedtruncated; else completed). Termination reason rolls up matching the status.
  • Single-node path unchanged.
  • ActiveSession now holds writers[] instead of a single writer; stopSession iterates and stops each.

Frontend

  • Start-session modal: new "Fan-out across all primaries (N nodes)" checkbox, shown only when isCluster. Disables the node selector when on. Helper text explains per-node failure semantics.
  • Session-detail page: header gains a "Fan-out segments (N nodes)" panel listing each node's address, status badge, line count, and termination reason when nodeSegments is present.

Test plan

  • SKIP_DOCKER_SETUP=true pnpm --filter api test -- --testPathPatterns "monitor|capture-sessions|health-gate|provider-detector|acl-checker|preflight|capture-writer|demo-mode|tail|monitor-line|cross-reference" → 228 tests across 15 suites, all pass
  • pnpm --filter api exec tsc --noEmit → exit 0
  • pnpm --filter web exec tsc --noEmit → exit 0
  • 4 new fan-out service cases pass: writers opened per node + chunks attributed; per-node partial failure (source error); partial failure (factory reject); non-cluster fan-out falls back to single-node

Live verification deferred — see "Live verification gap" below.

Notes for reviewers

  • Live verification gap (and how to close it). docker-compose.test.yml puts the 3-node cluster in network_mode: host, which on macOS Docker Desktop does NOT expose ports to the host. From a Linux runner the cluster is reachable at localhost:6401/6402/6403 and the spec's docker kill node-2 scenario can run end-to-end; from macOS it cannot without a compose change. Two ways to close this:

    • Add explicit ports: ["6401:6401", "6402:6402", "6403:6403"] to each cluster service in docker-compose.test.yml (would work cross-platform).
    • Run the live test in CI (Linux) as part of the integration suite.

    I've left the compose file alone in this PR to keep scope tight; happy to follow up with the port-mapping change if you'd prefer.

  • Chunk index namespacing (10M per writer) avoids a (session_id, chunk_index) PK collision across fan-out writers without changing the PK. A capture would have to push 10M chunks from one node to overflow the budget — far beyond any realistic session. If a future user surfaces that limit, the natural fix is to add node_id to the PK and migrate; the column already exists.

  • skipSessionFinalize: true is the cleanest split between writer and orchestrator I could find. The writer still owns its own chunk-flush + terminate lifecycle; only the final session-row UPDATE is centralized so per-writer races don't clobber the aggregate.

  • Aggregate status rules: any failed segment → failed; any truncated segment → truncated; else completed. Documented in aggregateSegmentStatus (exported pure function, testable in isolation).

  • Net diff: 592 lines. Heavier than budget but composition is unavoidable: schema in 3 adapters (~120) + CaptureWriter changes (~50) + service multi-writer orchestration (~200) + 4 fan-out tests (~140) + frontend toggle + session-header per-node panel (~50). Splitting further would leave intermediate PRs without verifiable surface.

  • getActiveWriter(connectionId) currently returns the first writer for fan-out sessions; the tail UI doesn't yet interleave lines across nodes. Cross-node line ordering belongs in a future tail-UI iteration (when the tail page gets a node filter).

Stacked PR

Base branch is feature/monitor-cluster-per-node (#177), so the diff shown is ONLY PR 14b changes. Together with #177, Phase 4 is complete pending the live cluster verification.

@jamby77 jamby77 force-pushed the feature/monitor-cluster-per-node branch from 68e5d64 to d1b2376 Compare May 12, 2026 11:12
@jamby77 jamby77 force-pushed the feature/monitor-cluster-fanout branch from 6f1f007 to e4e2c8a Compare May 12, 2026 11:12
@jamby77 jamby77 force-pushed the feature/monitor-cluster-per-node branch from d1b2376 to b90faee Compare May 12, 2026 12:39
@jamby77 jamby77 closed this May 12, 2026
@jamby77 jamby77 force-pushed the feature/monitor-cluster-fanout branch from e4e2c8a to b90faee Compare May 12, 2026 12:39
@github-actions github-actions Bot locked and limited conversation to collaborators May 12, 2026
@jamby77 jamby77 reopened this May 13, 2026
@jamby77 jamby77 force-pushed the feature/monitor-cluster-per-node branch from c9288a1 to f401307 Compare May 13, 2026 12:35
PR 14b of the split Phase 4. Closes the cluster story: a single
session can fan out to one CaptureWriter per cluster primary, all
writing into one logical sessionId with per-node attribution.
One node disconnecting mid-capture marks that segment failed and
the other writers keep running.

Schema:
- capture_sessions.node_segments (TEXT JSON / JSONB) — array of
  {nodeId, address, status, byteCount, lineCount, endedAt,
   terminationReason}. Idempotent ALTER ADD COLUMN for both
   sqlite + postgres.
- capture_chunks.node_id (TEXT, nullable) — per-chunk node
  attribution. NOT in PK; namespace collision is avoided by
  per-writer chunk_index ranges (see below).

CaptureWriter:
- New options: nodeId (chunk attribution), startChunkIndex
  (per-writer chunk_index namespace), skipSessionFinalize
  (fan-out writers don't finalize the session row; the
  orchestrator aggregates).

MonitorCaptureService:
- StartSessionInput gains fanOut: boolean.
- When fanOut && cluster: resolveFanOutNodes() returns healthy
  primaries via ClusterDiscoveryService, the service opens one
  writer per node with chunk_index ranges
  [0, NS), [NS, 2*NS), [2*NS, 3*NS), ... where NS = 10_000_000.
  Each writer gets skipSessionFinalize:true; the orchestrator
  collects per-writer results and aggregates.
- Partial-failure: a writer that fails (source error, factory
  reject, or terminate('failed', ...)) updates only its own
  node_segments entry; other writers keep going.
- Aggregate: session status = worst-case across segments
  (any failed → failed; any truncated → truncated; else
  completed). Termination reason rolls up matching the status.
- Single-node path unchanged.
- ActiveSession now holds writers[] instead of a single writer;
  stopSession iterates and stops each.

Frontend (start-session-modal):
- New "Fan-out across all primaries (N nodes)" checkbox, shown
  only when isCluster. Disables the node selector when on.
- Helper text: "One MONITOR connection per primary. Per-node
  status is recorded; one node failing mid-capture does not
  stop the others."

MonitorSession page:
- Header gains a "Fan-out segments (N nodes)" panel listing
  each node's address, status badge, line count, and
  termination reason when nodeSegments is present.

Tests (4 new fan-out cases, all green):
- Opens N writers, attributes chunks per node, aggregates
  segments on terminate with the right line/byte totals.
- Partial failure: node B source errors mid-capture → segment
  marked failed with source_error reason; node A still
  completes; aggregate status = failed.
- monitor_open_failed on one node up front: factory rejects;
  segment marked failed; other writers run normally; aggregate
  status = failed.
- Non-cluster fanOut request falls back to single-node start
  (nodeSegments stays undefined).

Total backend suite: 228 tests across 15 suites.

Live verification: blocked on macOS because
docker-compose.test.yml uses network_mode: host (Linux-only).
The test cluster is reachable from a Linux runner; once a
docker-compose port-mapping change lands, the spec's verification
(docker kill node-2 mid-fan-out) can run from the dev box. Unit
tests cover every fan-out behaviour in full; flagged for the
follow-up that adds the compose port mapping.

Part of PR 14 of 25 (14b of 14) in
docs/plans/specs/monitor-command/plan-implementation.md (Phase 4
complete pending the live test).
@jamby77 jamby77 force-pushed the feature/monitor-cluster-fanout branch from bd141cb to 5c06fd8 Compare May 13, 2026 12:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant