Skip to content

brotli-compress .socket.facts.json on upload#1291

Draft
Benjamin Barslev Nielsen (barslev) wants to merge 2 commits intov1.xfrom
barslev/compress-socket-facts-file
Draft

brotli-compress .socket.facts.json on upload#1291
Benjamin Barslev Nielsen (barslev) wants to merge 2 commits intov1.xfrom
barslev/compress-socket-facts-file

Conversation

@barslev
Copy link
Copy Markdown
Contributor

@barslev Benjamin Barslev Nielsen (barslev) commented Apr 30, 2026

Summary

Brotli-compress .socket.facts.json at upload time. Coana keeps writing plain JSON; only the wire bytes between socket-cli and depscan change. ~85% reduction on the simple-npm fixture; production mono-repo payloads see 60+ MB → ~10 MB.

How

  • New compressSocketFactsForUpload(scanPaths) in src/utils/coana.mts — async + streaming (createReadStream → createBrotliCompress → createWriteStream via pipeline). For any .socket.facts.json in scanPaths, writes a sibling .socket.facts.json.br next to the source file. Returns swapped paths + cleanup callback.
  • handle-create-new-scan.mts wraps fetchCreateOrgFullScan in try { compress; upload; } finally { cleanup; }.

Why sibling-write (not os.tmpdir(), not a temp subdir)

The SDK passes path.relative(cwd, brPath) as the multipart name. depscan's addStreamEntry rejects entries with .. traversal — a tmpdir path resolves to ../../../var/folders/... and gets dropped to unmatchedFiles. Sibling-write keeps the relative path inside cwd, and keeps directory shape symmetric with the plain .socket.facts.json upload (depscan strips the .br suffix at ingest, so <dir>/.socket.facts.json.br and <dir>/.socket.facts.json resolve to the same storage path).

Concurrent scans against the same source dir already collide on .socket.facts.json itself (coana writes to a single path), so the sibling .br doesn't introduce a new race.

Coordination

Requires the matching depscan multipart-decompress patch on the receiving side; land that first.

Tests

  • 16 unit cases in src/utils/coana.test.mts (sibling-path round-trip, passthrough, idempotent cleanup that leaves the source intact, plus the existing extractTier1ReachabilityScanId / extractReachabilityErrors matrix).
  • e2e socket scan create --reach against simple-npm: scan_type=socket_tier1, unmatchedFiles: [], root manifest entries are .socket.facts.json and package.json (storage path matches the plain upload — dir-preservation symmetry holds).

Compress `.socket.facts.json` with brotli at upload time, just before
`fetchCreateOrgFullScan` POSTs the multipart `/full-scans` body. Coana
keeps writing plain JSON; the local readers
(`extractTier1ReachabilityScanId`, `extractReachabilityErrors`) keep
reading plain JSON; only the on-the-wire bytes between socket-cli and
depscan change. depscan transparently decompresses at the api-v0
multipart ingest boundary in a coordinated change.

Why:
* Mono-repo `.socket.facts.json` files routinely exceed 60MB. On a
  representative simple-npm fixture, brotli compresses 150,748 bytes
  to 19,971 (~86.8% reduction); production mono-repos see a much
  larger absolute saving.
* Coana has a second consumer downstream - teaching coana to write
  `.br` would break it. Brotli on the wire-only path keeps coana's
  contract invariant.
* `tier1ReachabilityScanId` and reachability-error reporting still
  read plain JSON locally; no brotli round-trip on those paths.
* Compression is a transport detail owned by the upload site; cleanup
  is one `finally` block.

Implementation:
* `src/utils/coana.mts` - new exported
  `compressSocketFactsForUpload(scanPaths)` returning `{ paths, cleanup }`.
  Per `.socket.facts.json` entry, `mkdtempSync` a fresh `.socket-br-XXX/`
  directory inside the source's parent dir (NOT under `os.tmpdir()` -
  see below), `brotliCompressSync` bytes to
  `${td}/.socket.facts.json.br`, swap the path. Other paths pass through.
  Missing facts paths pass through. Cleanup is idempotent with
  `force: true`.
* `src/commands/scan/handle-create-new-scan.mts` - wraps the
  `fetchCreateOrgFullScan` call in
  `try { compress; upload; } finally { cleanup(); }`. Cleanup runs on
  success, throw, or any abort path.

Why the temp lives next to the source:

The SDK computes `path.relative(cwd, brPath)` for the multipart name
field. depscan's multipart ingest (`addStreamEntry`) checks
`containsTraversal(unixified)` and bails on any `..` segment. A
tmpdir-based path resolves to `../../../var/folders/...`, gets dropped
to `unmatchedFiles`, and the SocketFacts content never lands in the
scan. Putting the temp inside `path.dirname(originalFactsPath)`
produces a relative path like `.socket-br-XXX/.socket.facts.json.br` -
traversal-free, ingests cleanly.

Tests:
* `src/utils/coana.test.mts` - 16 cases.
  - `compressSocketFactsForUpload` x 5: round-trip JSON via
    `brotliDecompressSync`, basename swap to `.socket.facts.json.br`,
    non-facts paths pass through, missing facts paths pass through,
    cleanup idempotent, mixed-entry order. Pins the contract that the
    temp lives in a subdir of the source's parent (traversal-free).
  - `extractTier1ReachabilityScanId` x 7: plain JSON, missing file,
    missing field, null, empty/blank, trim, numeric coercion.
  - `extractReachabilityErrors` x 4: extraction, missing file,
    no-components, no-inner-reachability.

This change requires the matching depscan multipart decompression
patch on the receiving side; that change ships first.
@jdalton
Copy link
Copy Markdown
Contributor

This is rad!

@barslev Benjamin Barslev Nielsen (barslev) changed the title refactor: brotli-compress .socket.facts.json on upload brotli-compress .socket.facts.json on upload May 1, 2026
Previous form wrapped each `.br` in a per-scan `mkdtempSync` subdir
under the source dir for concurrency isolation. That created a
directory-handling asymmetry on depscan's side: a wire path of
`dirA/.socket.facts.json.br` flattened to `.socket.facts.json` at
root via depscan's basename-strip, while plain `dirA/.socket.facts.json`
preserved the dir.

depscan dropped the basename-strip in the corresponding PR; switch
to writing `.br` as a sibling of the source so wire and storage
paths match for both branches. Net effect: a brotli upload from
`<source>/.socket.facts.json` lands at `<source>/.socket.facts.json`
in the manifest tarball - identical to plain.

Concurrent-scan note: coana already writes to a single source path,
so the source `.socket.facts.json` itself is racy under concurrent
runs against the same dir; the sibling `.br` doesn't introduce a
new race that wasn't already there.

* `src/utils/coana.mts`: `compressSocketFactsForUpload` writes
  `${p}.br` next to the source; `cleanup()` does
  `rm(brPath, { force: true })` per file. Drops `mkdtemp` import
  that's no longer used.
* `src/utils/coana.test.mts`: directory-shape assertion replaced
  with `swappedPath === ${input}.br` (sibling). First test now
  also asserts the source survives `cleanup()`. 16 cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants