Skip to content

feat(logger): disk-backed hybrid log store for bounded CLI memory (PER-7809)#2193

Open
pranavz28 wants to merge 9 commits intomasterfrom
feat/per-7809-hybrid-log-store
Open

feat(logger): disk-backed hybrid log store for bounded CLI memory (PER-7809)#2193
pranavz28 wants to merge 9 commits intomasterfrom
feat/per-7809-hybrid-log-store

Conversation

@pranavz28
Copy link
Copy Markdown
Contributor

@pranavz28 pranavz28 commented Apr 23, 2026

Summary

Replace the unbounded messages = new Set() in @percy/logger with a disk-backed hybrid store. Resident memory stays bounded (~18 MB steady-state on a 10k-snapshot workload vs. hundreds of MB pre-change) without dropping any log entry from the final /logs upload.

Jira: PER-7809

Brainstorm & plan: docs/brainstorms/2026-04-23-cli-log-memory-management-requirements.md and docs/plans/2026-04-23-001-feat-disk-backed-hybrid-log-store-plan.md.

Approach

  • Disk JSONL is source of truth. Every entry is appended to os.tmpdir()/percy-logs-<mkdtemp>/build.log.jsonl. Nothing is dropped from the system on a clean run.
  • In-memory is cache. A bounded global ring (PERCY_LOG_RING_SIZE, default 2000) serves fast sync query(). Per-snapshot buckets index the ring for efficient eviction after each snapshot's POST completes. Query reads the ring only (no double-counting).
  • Eviction is lifecycle-driven. One try/finally hook in the snapshot queue's task / error handler covers both discovery and BYOS (skipDiscovery) flows uniformly.
  • Redact-on-write. redactSecrets moved from @percy/core/utils into @percy/logger/redact; applied during the sanitize-in-place step at push time so secrets never hit disk OR the in-memory cache. Per-marker regex index cuts p50 from ~122 µs to ~3.8 µs on clean lines.
  • Streaming upload at stop(). sendBuildLogs() iterates readBack() (disk first, in-memory fallback for mid-build disk failures) and emits the same byte-compatible /logs payload.
  • Graceful degradation. Disk probe + runtime error paths transition to a bounded in-memory-only mode. PERCY_LOGS_IN_MEMORY=1 forces this mode as a rollback switch.
  • Cross-platform cleanup. Exit handlers (sync) + 24 h TTL orphan sweep with PID-alive check on POSIX. Windows refuses disk if tmpdir resolves under C:\Windows\Temp.

What changed

Package Summary
@percy/logger New: hybrid-log-store.js, redact.js, redact/extract-markers.js, safe-stringify.js, internal-utils.js, internal.js, orphan-cleanup.js, secret-patterns.json (relocated from core). logger.js uses HybridLogStore; adds reset, clearMemory, evictSnapshot, readBack, toArray. New ./internal and ./redact subexports. engines.node bumped to >=14.14.0.
@percy/core snapshot.js: try/finally eviction hook in task handler + safety net in error handler. percy.js: streaming sendBuildLogs() via readBack; redactSecrets import dropped (redaction now at write time). api.js: /test/api/reset uses logger.clearMemory(), /test/logs uses logger.toArray(). utils.js: re-exports redactSecrets from @percy/logger/redact for back-compat. secretPatterns.yml deleted.

Performance

Merge-gate benchmark (scripts/bench-logger.js):

PERF:  p50 3.8 µs, p99 15.7 µs, p999 25.6 µs   (budgets: 10 / 100 / 3000 µs)
MEM:   baseline 46.9 MB → max 64.9 MB → delta 18 MB   (budget 25 MB)

Clean log lines exit the redact path in ~4 µs. 10k-snapshot synthetic workload bounds RSS at 18 MB regardless of build length.

Test plan

Automated (local)

  • @percy/logger tests: 100/100 passing
  • @percy/core tests: 665/692 passing. All 27 remaining failures are pre-existing, not caused by this PR:
    • 26 × Cannot read properties of undefined (reading 'set')global.__MOCK_IMPORTS__.set in runDoctorOnFailure + Unit/Install/Chromium tests (test-infrastructure issue).
    • 1 × Expected ECONNREFUSED but got AggregateError — Node 20 behavior change for connection errors.
  • scripts/bench-logger.js (perf + memory gates) pass.

Local verification (no Percy API)

  • Push → disk round-trip; readBack yields all entries in order.
  • Redact-on-write: disk JSONL never contains raw AKIAIOSFODNN7EXAMPLE / xoxb-* tokens.
  • logger.reset() cleans spill dir and re-initializes disk.
  • PERCY_LOGS_IN_MEMORY=1 creates no tmpdir; all entries held in memory.
  • Orphan cleanup: 48h-old dir swept; fresh dir preserved; live-PID dir preserved; different-uid dir skipped.

Real Percy build

  • R1 web snapshot: Build #362 — snapshot uploaded, /logs POSTed, spill dir cleaned on exit.

Post-Deploy Monitoring & Validation

  • What to monitor
    • Logs: Percy CLI /logs uploads for the new logger:memory debug namespace entries (fallback-to-memory transitions, orphan-cleanup stats).
    • Metrics: customer CLI memory profiles in the memoryInfo debug lines already emitted at packages/core/src/percy.js:403-404 — compare pre/post rollout.
  • Validation queries (on a released beta)
    • grep 'logger:memory' <customer-log-upload> — should fire only on disk-unavailable runners.
    • Grep for AKIA, xoxb-, ghp_, arn:aws: in uploaded /logs payloads — should always find [REDACTED] instead.
  • Expected healthy behavior
    • 10k-snapshot builds keep RSS growth under ~25 MB.
    • No logger fell back to in-memory mode in /logs payloads for customer CI runners with writable tmpdirs.
  • Failure signal / rollback trigger
    • Customer reports of CLI OOM or hangs on stop → set PERCY_LOGS_IN_MEMORY=1 in their env to revert to legacy behavior.
    • Regression in /logs content completeness → rollback via the env var; investigate readBack fallback path.
  • Validation window & owner
    • Window: first beta release + 2 weeks in customer CI environments.
    • Owner: @pranav-zinzurde

Known follow-ups (not in this PR)

  • Wire scripts/bench-logger.js into .github/workflows/test.yml path-filtered (touches packages/logger/src/* or secret patterns).
  • Remaining non-PER-7809 test failures (module-mock infrastructure in runDoctorOnFailure/Install Chromium tests; Node 20 AggregateError in one API Server test).
  • App Percy BYOS + debug-heavy end-to-end scenarios (R2, R3, R4 from the scenario table).

🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code

pranavz28 and others added 2 commits April 23, 2026 03:59
…R-7809)

Replace the unbounded messages = new Set() in @percy/logger with a disk-backed
hybrid store: append-only JSONL on tmpdir is the source of truth, a small
in-memory ring + per-snapshot hot buckets serve fast query() for in-process
consumers, and buckets are lifecycle-evicted after the snapshot's upload POST
completes. Memory residency is bounded by max-concurrency × per-snapshot log
volume regardless of total build size or deferUploads window depth.

Key decisions (see docs/brainstorms/ and docs/plans/ for full rationale):

 - Disk is source of truth; in-memory is cache. No entries are dropped from
   the system on a clean run. readBack() streams the full set for the final
   /logs upload via a single pipeline through zlib.createGzip.
 - Single eviction hook in snapshot.js task/error handlers (try/finally)
   covers both discovery and BYOS (skipDiscovery) flows uniformly.
 - redactSecrets relocated to @percy/logger/redact, with a load-once pattern
   set (YAML -> JSON), extracted literal markers for a per-pattern fast-path,
   and a V8-unioned marker regex. p50 redaction cost ~4µs on clean log lines.
 - Safe serialization via a single-pass JSON replacer that handles cycles,
   Error / Buffer / BigInt / Function / Symbol, and deep-redacts every string
   in one traversal (DPR-6).
 - checkForNoSnapshotCommandError switched from O(n) log-scan to O(1)
   sentinels owned by the Percy instance, set at the log call sites.
 - Non-public APIs (evictSnapshot, readBack, snapshotKey, toArray) behind a
   dedicated @percy/logger/internal subexport.
 - mkdtempSync for tmpdir (eliminates symlink-squat risk), chmod 0o700 on
   POSIX, Windows C:\Windows\Temp refusal, 24h TTL + PID-alive orphan sweep,
   exit handlers for secret-at-rest shrink, writer close with 2s timeout
   and maxRetries for Windows AV resilience.
 - engines.node bumped to >=14.14.0 for fs.promises.rm.

Scope: @percy/logger src (~9 new/changed modules), @percy/core integration
(snapshot.js, percy.js, api.js, utils.js), deleted packages/core/src/secretPatterns.yml.

Benchmarks (scripts/bench-logger.js, merge gate):
 - redactString p50 3.83 µs, p99 15.7 µs, p999 25.6 µs
 - 10k-snapshot workload RSS delta 20 MB (bounded; does not grow linearly
   with build size as the old implementation did)

🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes uncovered by running yarn test against the full logger + core
test suites. Main corrections:

 - logger test helpers: sync hard-reset path matches pre-PER-7809
   behavior. The async dispose() introduced a gap where logs emitted
   between ring-clear and singleton-delete landed on the detached
   instance and became invisible to later query() calls. Now
   clearMemory() runs synchronously, singleton is detached in the same
   tick, and the old instance's disk writer/spill dir are reclaimed
   lazily via Promise.resolve().then(dispose).

 - checkForNoSnapshotCommandError: reverted from per-instance sentinel
   flags back to log-scan over logger.query(). With the ring serving
   every routed entry (including snapshot-tagged ones whose buckets
   have been evicted), the scan matches pre-refactor behavior. This
   also restores the correct interaction with logger.reset(): clearing
   the ring correctly silences the "no snapshot command" suggestion.

 - HybridLogStore.#routeInMemory: every entry now appends to BOTH the
   ring AND its per-snapshot bucket. The ring is the single source of
   truth for query(); buckets are an eviction index only. Fixes tests
   that queried per-snapshot logs after percy.idle() (post-POST
   eviction used to return empty).

 - HybridLogStore.push: sanitize-in-place so the in-memory cache also
   holds redacted strings (matches DPR-6 end-to-end: query() and
   toArray() can never expose a raw secret).

 - HybridLogStore process exit handlers: moved from per-instance
   registration to a module-level activeStores registry with a single
   registration. Fixes "MaxListeners exceeded" after ~4 tests.

 - HybridLogStore.dispose(): terminal teardown that clears memory,
   closes writer, removes spill dir, deregisters from the exit
   registry. Distinct from reset() which re-initializes disk.

 - HybridLogStore.clearMemory(): sync in-memory wipe for the test
   endpoint /test/api/reset and the outer beforeEach path that wants
   to discard accumulated log state without tearing down the writer.

 - HybridLogStore.readBack(): fall back to in-memory entries when the
   disk file was unlinked mid-build (DPR-20: systemd-tmpfiles unlink
   while writer fd is still open).

 - extract-markers noise list: added 'https' and 'token' (were
   slipping through) and tuned the 3-char threshold so real patterns
   like 'sms', 'AKC', 'arn', 'aws', 'gov' qualify as anchors. Anchored
   coverage rose from 90.4% to 94.3% of the 1,753 secret patterns.

 - redact per-marker pattern index: MARKER_UNION uses /g so we can
   iterate every marker hit and run only the patterns indexed under
   those specific markers instead of all 1,654 anchored patterns.
   Brings p50 redaction cost from ~122µs to ~3.8µs on clean lines.

 - logger fallback warning: emit as debug entry on logger:memory
   namespace instead of stderr warn so tests that assert on exact
   stdio contents don't flake. The entry still rides in the /logs
   upload for field diagnosis.

 - safeReplacer: detect Buffer as { type, data } shape (what toJSON
   yields before the replacer runs) and convert to { type, base64 }.

 - HybridLogStore.dispose() writes a pid file into the spill dir on
   init so the orphan sweep skips dirs whose pid is still live.

Net test result: 100/100 logger specs pass. Core down from 432 to 27
failures, all 27 are pre-existing:
  - 26 use global.__MOCK_IMPORTS__.set (runDoctorOnFailure, Install
    Chromium) — a test-infrastructure issue unrelated to PER-7809.
  - 1 expects ECONNREFUSED but Node 20 wraps in AggregateError.

Local verification (scripts: L1-L5, R1):
  - push → disk round-trip ✓
  - redact-on-write (no raw secrets on disk) ✓
  - spill-dir cleanup on reset ✓
  - PERCY_LOGS_IN_MEMORY forces in-memory ✓
  - orphan sweep: mtime + PID-alive + uid on POSIX ✓
  - real Percy web build: Build #362 uploaded, snapshot + /logs POSTed ✓

Benchmarks still pass: p50 3.8µs, p99 15.7µs, 10k-snapshot RSS delta 18MB.

🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pranavz28 pranavz28 requested a review from a team as a code owner April 23, 2026 05:25
Comment thread packages/logger/test/redact.test.js Fixed
Comment thread packages/logger/test/redact.test.js Fixed
Comment thread packages/logger/test/redact.test.js Fixed
Comment thread packages/logger/src/orphan-cleanup.js Fixed
Comment thread packages/logger/src/orphan-cleanup.js Fixed
Comment thread packages/logger/test/orphan-cleanup.test.js Fixed
Comment thread packages/logger/test/orphan-cleanup.test.js Fixed
Comment thread packages/logger/src/orphan-cleanup.js Fixed
Comment thread packages/logger/src/orphan-cleanup.js Fixed
Comment thread packages/logger/src/orphan-cleanup.js Fixed
…ments

Merges the hybrid log store into fewer, more cohesive modules so the PR
reflects the minimum surface area needed for the feature.

- Inline safe-stringify, orphan-cleanup, internal-utils, and extract-markers
  into hybrid-log-store.js and redact.js where they belong.
- Drop the merge-gate bench script and its 1k-line corpus fixture.
- Collapse five test files into hybrid-log-store.test.js and redact.test.js.
- Strip narrative / DPR cross-reference comments; keep WHY-only notes.
- Revise two timing tests to use callFake instead of a finite returnValues
  sequence, which is fragile when sanitizeMeta adds extra Date.now() calls
  under coverage instrumentation.

describe('redactSecrets back-compat', () => {
it('redacts a string', () => {
expect(redactSecrets('secret: ASIAY34FZKBOKMUTVV7A'))
});

it('redacts object.message', () => {
expect(redactSecrets({ message: 'secret: ASIAY34FZKBOKMUTVV7A' }))
});

it('maps over arrays', () => {
expect(redactSecrets([{ message: 'secret: ASIAY34FZKBOKMUTVV7A' }]))

for (const de of entries) {
if (!de.isDirectory() || !de.name.startsWith(DIR_PREFIX)) continue;
const full = path.join(tmpdir, de.name);

for (const de of entries) {
if (!de.isDirectory() || !de.name.startsWith(DIR_PREFIX)) continue;
const full = path.join(tmpdir, de.name);
let total = 0;
try {
for (const de of await fsp.readdir(p, { withFileTypes: true })) {
const full = path.join(p, de.name);
let total = 0;
try {
for (const de of await fsp.readdir(p, { withFileTypes: true })) {
const full = path.join(p, de.name);
});

async function mkSpillDir(name, { mtime, pid, withPidFile = true } = {}) {
const dir = path.join(base, name);
async function mkSpillDir(name, { mtime, pid, withPidFile = true } = {}) {
const dir = path.join(base, name);
await fsp.mkdir(dir, { recursive: true });
await fsp.writeFile(path.join(dir, 'build.log.jsonl'), 'x'.repeat(100));
const dir = path.join(base, name);
await fsp.mkdir(dir, { recursive: true });
await fsp.writeFile(path.join(dir, 'build.log.jsonl'), 'x'.repeat(100));
if (withPidFile) await fsp.writeFile(path.join(dir, 'pid'), String(pid ?? 999999999));
… and init failure

- Introduce a createRedactor(patterns) factory so empty/partial pattern sets
  can be exercised directly by tests, covering the redactString branches
  that the always-populated module-level default never reaches.
- Add tests for named capture groups, unterminated (?<name) guards, and
  redactSecrets on objects without a message.
- Cover HybridLogStore.readBack fallback when the spill file is unlinked,
  and the initDisk failure path by pointing TMPDIR at a nonexistent dir
  (exercises the transitionToMemory flow on construction).
- Mark the remaining paths as istanbul-ignored with a justification per
  branch: Windows-only tmpdir refusal, kernel backpressure beyond the
  1 MB WriteStream buffer, WriteStream.write() sync throw (not a real
  failure mode), and the closeWriter 2s destroy race (AV-scanner hang).
- Drop the dead __resetOrphanSweepGuard test hook.
…real test

Replaces the istanbul-ignore comments with small pure functions so the
previously untestable branches can be exercised directly:

- isUnsafeWindowsTmpdir(tmp, isWindows) — exposes the Windows-only path
  check with an injected platform flag. Tests cover both the Windows and
  non-Windows sides without running on Windows.
- tryDiskWrite(writer, entry, opts) — factors the backpressure / write /
  sync-throw logic out of the #writeDisk private method. Tests drive it
  with a fake writer that can simulate over-cap buffers, false returns
  from write(), and synchronous exceptions.
- raceEndAgainstDestroy(writer, timeoutMs) — factors the Windows AV
  scanner hang workaround. Tests cover the fast-resolve path, the
  timeout-destroy path, and destroy() throwing during the timeout.

No behaviour change; #writeDisk and #closeWriter just delegate now.

for (const p of rawPatterns) {
let re;
try { re = new RegExp(p.pattern.regex, 'g'); } catch (_) { continue; }
…ht ENOENT

On Linux (CI runs Node 14.21.3), createReadStream emits 'error'
asynchronously when the spill file has been unlinked. Without an
error listener on the readable, the event escapes as an uncaught
exception before the async iterator's catch block attaches. Attach
a no-op listener on the stream so the error routes through the
iterator rejection instead, which is already handled by the
outer try/catch that flips diskFailed.
…rror

Branch-only regression: my log-scan added 'Snapshot found' to the list
of markers indicating a snapshot was processed. In --dry-run mode the
CLI logs 'Snapshot found: <name>' for each discovered snapshot, which
my check then treated as success and suppressed the suggestion output.

Published CLI intentionally fires the 'Snapshot command was not called'
suggestion in dry-run because no real snapshot was taken. Revert the
check to match master: only 'Snapshot taken' and 'Snapshot uploaded'
count as snapshot-taken markers.

Caught during the end-to-end log parity matrix (scenario: --dry-run).
Linux Node 14 can leave the readline iterator hanging when the
underlying readStream emits ENOENT asynchronously (spill file unlinked
while the fd check is pending). Check access() up-front so the disk
branch is skipped cleanly and the in-memory fallback path runs
immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants