Add lightweight raincloud loader, publish CLI, and wheel-installable build pipeline#15

Merged

mprammer merged 33 commits into

mp/datasets-loader

May 29, 2026

Contributor

mprammer commented May 29, 2026

This branch adds an importable raincloud package: raincloud.load("<slug>") returns a lazy Dataset over an already-prepared Parquet/Vortex artifact, resolving local cache → mirror → local build, so the catalog is usable from a lightweight pip install "raincloud @ git+https://github.com/spiraldb/raincloud" instead of a full checkout. Alongside it: a scripts.pipeline.publish CLI that syncs built artifacts to a mirror gated on the snapshot's recorded sha256, and a configurable data-area root (scripts/pipeline/spec.py:data_root()) so the build pipeline can run from a wheel install with no checkout, writing under ~/.cache/raincloud.

The install is now layered, which is breaking: a bare install pulls only the loader (pyarrow, numpy, vortex-data, fsspec), and the heavy build toolchain (duckdb, osmium, pyreadstat, …) moved behind the [build] extra — build/handler work needs uv sync --extra build. Artifact integrity is size- and sha256-based with a default warn-and-adopt policy (RAINCLOUD_STRICT_CHECKSUM=1 for a hard gate), and locally-built artifacts are trusted by provenance rather than rebuilt on every load. Per-area detail — env vars, resolution order, integrity semantics — is in the CHANGELOG 0.2.0 entry.

🤖 Generated with Claude Code

mprammer and others added 30 commits

May 27, 2026 13:34


          feat(loader): package skeleton, exceptions, lightweight packaging

418a055

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): catalog module reading snapshot + manifest

3a7d57e

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): checksum-gated local cache with atomic adopt

aee0652

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): fsspec transport-only fetch

1a6c0be

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): cache->mirror->build resolver

04abeed

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): lazy Dataset handle and load()/load_dataset

f9a0189

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(docs): carry parquet/vortex sha256 in snapshot for the loader

10ea912

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          chore(loader): add SPDX headers to new loader files

8b1f255

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(publish): mirror-sync CLI gated on snapshot sha256

11225fb

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(loader): hermetic file:// mirror e2e; document loader usage

a8175a0

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          docs(loader): document loader + publish, layered install, 0.2.0 chang…

…elog

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          fix(loader): clean BuildToolingMissing on loader-only install; doc + …

12a6c66

…forward-compat fixups

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          docs(loader): install from GitHub, not PyPI (name squatted)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(pipeline): configurable data_root() for wheel builds

48beac7

Adds data_root() (RAINCLOUD_HOME -> checkout -> ~/.cache/raincloud) plus
outputs_base/outputs_root/raw_downloads_root/workdir_root/display_path and a
packaged-manifest fallback in load_manifest. Removes DEFAULT_MANIFEST. Checkout
behavior is unchanged (data_root()==REPO_ROOT when sources.json is present).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(pipeline): root fetch/extract/build/status on data_root() helpers

734f91f

Remove import-time ORIGINALS_DIR (fetch.py) and WORKDIR (extract.py)
constants; route all callers through spec.raw_downloads_root() /
spec.workdir_root() so the paths honor RAINCLOUD_RAW_DOWNLOADS and
RAINCLOUD_WORKDIR env-vars at call time. Replace .relative_to(REPO_ROOT)
log calls with display_path() throughout. Update status.py to import
raw_downloads_root/workdir_root directly from spec; update build.py
clean-workdir block likewise. Redirect the fetch test monkeypatch to the
env-var approach and add a new test confirming both slug helpers honor
the redirected roots.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          fix(pipeline): use display_path for logs so wheel builds don't ValueE…

ff3f68b

…rror

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          fix(handlers): use display_path for output logs (wheel-build safe)

4f8b06b

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          fix(handlers): root scratch workdir on workdir_root() (wheel-safe)

11ef84e

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(pipeline): ship + resolve sources.schema.json for wheel installs

86da060

Add sources.schema.json to the wheel's force-include so it lands in
raincloud/_data/ alongside sources.json and snapshot.json. Replace the
module-level SCHEMA_PATH constant in validate_manifest with a _schema_path()
resolver that prefers the checkout copy and falls back to _packaged_data(),
matching the pattern established for load_manifest in spec.py.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          docs(pipeline): document data-area env vars; note wheel build-fallback

5d23066

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          fix(loader): expose formats from manifest+snapshot union (seamless fa…

cded8a8

…llback)

Catalog.entry() now derives parquet from any manifest slug and vortex from
convert.vortex|snapshot, with checksums None for never-snapshotted slugs.
Removes the snapshot-gating that blocked load() of buildable-but-not-yet-
snapshotted manifest slugs.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test: --run-wheel/--run-network gates + markers; CI lint-and-test --e…

956a559

…xtra build

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(wheel): session-built wheel + venv helpers + smoke test

da9bd6d

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(wheel): install-tier regression guard (base lightweight, [build]…

f9262a1

… heavy, extras)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(wheel): loader API matrix against installed wheel (happy + error…

1f7b551

…s + scan + to_pandas)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(wheel): build-proof via load('synth') end-to-end in [build] venv

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(pipeline): always-on hermetic e2e build via build.run_one + hand…

68dcbab

…ler smoke

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(pipeline): real network build via load(slug) across 3 tiny HTTP …

45bd111

…slugs

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          ci: add wheel (blocking) + realbuild (non-blocking) jobs

0b15738

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          test(pipeline): pin RAINCLOUD_HOME hermeticity in real-build test

886f6f0

Adds an explicit on-disk assertion that the build subprocess wrote the
parquet under tmp_path/home/outputs/v1/<slug>/, not the repo's outputs/.
Without it, a regression that stopped honoring RAINCLOUD_HOME would let
the build silently overwrite tracked repo state while the row-count
assertion still passed (the loader cache would still adopt the wrong-
location artifact). Final-review follow-up.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          feat(loader): runnable examples, load/publish skills, integrity + doc…

5418f35

… hardening

Completes the uncommitted tail of the datasets-loader work and folds in two
rounds of adversarial review of the loader / publish / wheel-build feature:

- Runnable examples (use_loader, kepler, nyc_taxi, olympic, wine) and the
  raincloud-load / raincloud-publish agent skills.
- Cache integrity: snapshot byte size as a corruption check for sha-less
  slugs (origin-gated); strict mode trusts a locally-built artifact via its
  origin=build provenance pin instead of rebuilding every load, rebuilding
  only when the snapshot pin changes.
- publish: temp-key + rename upload (no truncated object at the canonical key
  on a crash); env-aware snapshot via spec.default_snapshot(); reuse the
  loader's sha256_file / artifact_key.
- Typed errors (BuildFailed; missing-[build]-extra vs broken-toolchain
  message); read_pin rejects non-dict JSON; atomic pin writes.
- Tests: strict-build serve-from-cache + snapshot-revision staleness, sha-less
  size/pin paths, cache_root default/expanduser, publish atomic upload, path
  hermeticity; autouse loader-cache isolation; RAINCLOUD_* scrub in subprocess
  tests.
- Docs: accurate integrity/strict semantics (README/AGENTS/SKILLS/CHANGELOG),
  examples install via git+https, skills index + count (21).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>

github-advanced-security AI found potential problems

View reviewed changes

.github/workflows/ci.yml Fixed

.github/workflows/ci.yml Fixed

mprammer marked this pull request as ready for review

May 29, 2026 19:45

mprammer and others added 2 commits

May 29, 2026 15:55


          ci: scope wheel job to test_wheel.py + restrict GITHUB_TOKEN to conte…

882a931

…nts:read

Two CI fixes for this branch's own workflow:

- wheel job: it syncs --extra dev only (by design — it tests the wheel's
  layered install in throwaway venvs), but pytest imports every collected
  module before applying -m wheel, and tests/test_manifest.py +
  tests/test_profile.py import jsonschema (a [build] dep) at module top, so
  collection ModuleNotFoundError'd before any wheel test ran. All wheel-marked
  tests live in tests/test_wheel.py; collect only that file. Reproduced and
  fixed against a dev-only venv: full-tree collect errors on the two modules,
  scoped collect yields 13 tests / exit 0.

- permissions: add a top-level `contents: read` block. Every job is read-only
  (checkout + deps + lint/tests); none writes to the repo. Resolves the two
  CodeQL "workflow does not contain permissions" findings on this PR.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>


          docs(changelog): stamp 0.2.0 release date

dd37ce3

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>

mprammer merged commit f3b4136 into develop

7 checks passed

mprammer deleted the mp/datasets-loader branch

May 29, 2026 20:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet