Skip to content

Add lightweight raincloud loader, publish CLI, and wheel-installable build pipeline#15

Merged
mprammer merged 33 commits into
developfrom
mp/datasets-loader
May 29, 2026
Merged

Add lightweight raincloud loader, publish CLI, and wheel-installable build pipeline#15
mprammer merged 33 commits into
developfrom
mp/datasets-loader

Conversation

@mprammer
Copy link
Copy Markdown
Contributor

This branch adds an importable raincloud package: raincloud.load("<slug>") returns a lazy Dataset over an already-prepared Parquet/Vortex artifact, resolving local cache → mirror → local build, so the catalog is usable from a lightweight pip install "raincloud @ git+https://github.com/spiraldb/raincloud" instead of a full checkout. Alongside it: a scripts.pipeline.publish CLI that syncs built artifacts to a mirror gated on the snapshot's recorded sha256, and a configurable data-area root (scripts/pipeline/spec.py:data_root()) so the build pipeline can run from a wheel install with no checkout, writing under ~/.cache/raincloud.

The install is now layered, which is breaking: a bare install pulls only the loader (pyarrow, numpy, vortex-data, fsspec), and the heavy build toolchain (duckdb, osmium, pyreadstat, …) moved behind the [build] extra — build/handler work needs uv sync --extra build. Artifact integrity is size- and sha256-based with a default warn-and-adopt policy (RAINCLOUD_STRICT_CHECKSUM=1 for a hard gate), and locally-built artifacts are trusted by provenance rather than rebuilt on every load. Per-area detail — env vars, resolution order, integrity semantics — is in the CHANGELOG 0.2.0 entry.

🤖 Generated with Claude Code

mprammer and others added 30 commits May 27, 2026 13:34
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…elog

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…forward-compat fixups

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds data_root() (RAINCLOUD_HOME -> checkout -> ~/.cache/raincloud) plus
outputs_base/outputs_root/raw_downloads_root/workdir_root/display_path and a
packaged-manifest fallback in load_manifest. Removes DEFAULT_MANIFEST. Checkout
behavior is unchanged (data_root()==REPO_ROOT when sources.json is present).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Remove import-time ORIGINALS_DIR (fetch.py) and WORKDIR (extract.py)
constants; route all callers through spec.raw_downloads_root() /
spec.workdir_root() so the paths honor RAINCLOUD_RAW_DOWNLOADS and
RAINCLOUD_WORKDIR env-vars at call time. Replace .relative_to(REPO_ROOT)
log calls with display_path() throughout. Update status.py to import
raw_downloads_root/workdir_root directly from spec; update build.py
clean-workdir block likewise. Redirect the fetch test monkeypatch to the
env-var approach and add a new test confirming both slug helpers honor
the redirected roots.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…rror

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Add sources.schema.json to the wheel's force-include so it lands in
raincloud/_data/ alongside sources.json and snapshot.json. Replace the
module-level SCHEMA_PATH constant in validate_manifest with a _schema_path()
resolver that prefers the checkout copy and falls back to _packaged_data(),
matching the pattern established for load_manifest in spec.py.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…llback)

Catalog.entry() now derives parquet from any manifest slug and vortex from
convert.vortex|snapshot, with checksums None for never-snapshotted slugs.
Removes the snapshot-gating that blocked load() of buildable-but-not-yet-
snapshotted manifest slugs.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…xtra build

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
… heavy, extras)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…s + scan + to_pandas)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…ler smoke

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…slugs

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds an explicit on-disk assertion that the build subprocess wrote the
parquet under tmp_path/home/outputs/v1/<slug>/, not the repo's outputs/.
Without it, a regression that stopped honoring RAINCLOUD_HOME would let
the build silently overwrite tracked repo state while the row-count
assertion still passed (the loader cache would still adopt the wrong-
location artifact). Final-review follow-up.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
… hardening

Completes the uncommitted tail of the datasets-loader work and folds in two
rounds of adversarial review of the loader / publish / wheel-build feature:

- Runnable examples (use_loader, kepler, nyc_taxi, olympic, wine) and the
  raincloud-load / raincloud-publish agent skills.
- Cache integrity: snapshot byte size as a corruption check for sha-less
  slugs (origin-gated); strict mode trusts a locally-built artifact via its
  origin=build provenance pin instead of rebuilding every load, rebuilding
  only when the snapshot pin changes.
- publish: temp-key + rename upload (no truncated object at the canonical key
  on a crash); env-aware snapshot via spec.default_snapshot(); reuse the
  loader's sha256_file / artifact_key.
- Typed errors (BuildFailed; missing-[build]-extra vs broken-toolchain
  message); read_pin rejects non-dict JSON; atomic pin writes.
- Tests: strict-build serve-from-cache + snapshot-revision staleness, sha-less
  size/pin paths, cache_root default/expanduser, publish atomic upload, path
  hermeticity; autouse loader-cache isolation; RAINCLOUD_* scrub in subprocess
  tests.
- Docs: accurate integrity/strict semantics (README/AGENTS/SKILLS/CHANGELOG),
  examples install via git+https, skills index + count (21).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Comment thread .github/workflows/ci.yml Fixed
Comment thread .github/workflows/ci.yml Fixed
@mprammer mprammer marked this pull request as ready for review May 29, 2026 19:45
mprammer and others added 2 commits May 29, 2026 15:55
…nts:read

Two CI fixes for this branch's own workflow:

- wheel job: it syncs --extra dev only (by design — it tests the wheel's
  layered install in throwaway venvs), but pytest imports every collected
  module before applying -m wheel, and tests/test_manifest.py +
  tests/test_profile.py import jsonschema (a [build] dep) at module top, so
  collection ModuleNotFoundError'd before any wheel test ran. All wheel-marked
  tests live in tests/test_wheel.py; collect only that file. Reproduced and
  fixed against a dev-only venv: full-tree collect errors on the two modules,
  scoped collect yields 13 tests / exit 0.

- permissions: add a top-level `contents: read` block. Every job is read-only
  (checkout + deps + lint/tests); none writes to the repo. Resolves the two
  CodeQL "workflow does not contain permissions" findings on this PR.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
@mprammer mprammer merged commit f3b4136 into develop May 29, 2026
7 checks passed
@mprammer mprammer deleted the mp/datasets-loader branch May 29, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants