Add lightweight raincloud loader, publish CLI, and wheel-installable build pipeline#15
Merged
Conversation
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…elog Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…forward-compat fixups Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds data_root() (RAINCLOUD_HOME -> checkout -> ~/.cache/raincloud) plus outputs_base/outputs_root/raw_downloads_root/workdir_root/display_path and a packaged-manifest fallback in load_manifest. Removes DEFAULT_MANIFEST. Checkout behavior is unchanged (data_root()==REPO_ROOT when sources.json is present). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Remove import-time ORIGINALS_DIR (fetch.py) and WORKDIR (extract.py) constants; route all callers through spec.raw_downloads_root() / spec.workdir_root() so the paths honor RAINCLOUD_RAW_DOWNLOADS and RAINCLOUD_WORKDIR env-vars at call time. Replace .relative_to(REPO_ROOT) log calls with display_path() throughout. Update status.py to import raw_downloads_root/workdir_root directly from spec; update build.py clean-workdir block likewise. Redirect the fetch test monkeypatch to the env-var approach and add a new test confirming both slug helpers honor the redirected roots. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…rror Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Add sources.schema.json to the wheel's force-include so it lands in raincloud/_data/ alongside sources.json and snapshot.json. Replace the module-level SCHEMA_PATH constant in validate_manifest with a _schema_path() resolver that prefers the checkout copy and falls back to _packaged_data(), matching the pattern established for load_manifest in spec.py. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…llback) Catalog.entry() now derives parquet from any manifest slug and vortex from convert.vortex|snapshot, with checksums None for never-snapshotted slugs. Removes the snapshot-gating that blocked load() of buildable-but-not-yet- snapshotted manifest slugs. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…xtra build Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
… heavy, extras) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…s + scan + to_pandas) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…ler smoke Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…slugs Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds an explicit on-disk assertion that the build subprocess wrote the parquet under tmp_path/home/outputs/v1/<slug>/, not the repo's outputs/. Without it, a regression that stopped honoring RAINCLOUD_HOME would let the build silently overwrite tracked repo state while the row-count assertion still passed (the loader cache would still adopt the wrong- location artifact). Final-review follow-up. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
… hardening Completes the uncommitted tail of the datasets-loader work and folds in two rounds of adversarial review of the loader / publish / wheel-build feature: - Runnable examples (use_loader, kepler, nyc_taxi, olympic, wine) and the raincloud-load / raincloud-publish agent skills. - Cache integrity: snapshot byte size as a corruption check for sha-less slugs (origin-gated); strict mode trusts a locally-built artifact via its origin=build provenance pin instead of rebuilding every load, rebuilding only when the snapshot pin changes. - publish: temp-key + rename upload (no truncated object at the canonical key on a crash); env-aware snapshot via spec.default_snapshot(); reuse the loader's sha256_file / artifact_key. - Typed errors (BuildFailed; missing-[build]-extra vs broken-toolchain message); read_pin rejects non-dict JSON; atomic pin writes. - Tests: strict-build serve-from-cache + snapshot-revision staleness, sha-less size/pin paths, cache_root default/expanduser, publish atomic upload, path hermeticity; autouse loader-cache isolation; RAINCLOUD_* scrub in subprocess tests. - Docs: accurate integrity/strict semantics (README/AGENTS/SKILLS/CHANGELOG), examples install via git+https, skills index + count (21). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…nts:read Two CI fixes for this branch's own workflow: - wheel job: it syncs --extra dev only (by design — it tests the wheel's layered install in throwaway venvs), but pytest imports every collected module before applying -m wheel, and tests/test_manifest.py + tests/test_profile.py import jsonschema (a [build] dep) at module top, so collection ModuleNotFoundError'd before any wheel test ran. All wheel-marked tests live in tests/test_wheel.py; collect only that file. Reproduced and fixed against a dev-only venv: full-tree collect errors on the two modules, scoped collect yields 13 tests / exit 0. - permissions: add a top-level `contents: read` block. Every job is read-only (checkout + deps + lint/tests); none writes to the repo. Resolves the two CodeQL "workflow does not contain permissions" findings on this PR. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch adds an importable
raincloudpackage:raincloud.load("<slug>")returns a lazyDatasetover an already-prepared Parquet/Vortex artifact, resolving local cache → mirror → local build, so the catalog is usable from a lightweightpip install "raincloud @ git+https://github.com/spiraldb/raincloud"instead of a full checkout. Alongside it: ascripts.pipeline.publishCLI that syncs built artifacts to a mirror gated on the snapshot's recorded sha256, and a configurable data-area root (scripts/pipeline/spec.py:data_root()) so the build pipeline can run from a wheel install with no checkout, writing under~/.cache/raincloud.The install is now layered, which is breaking: a bare install pulls only the loader (pyarrow, numpy, vortex-data, fsspec), and the heavy build toolchain (duckdb, osmium, pyreadstat, …) moved behind the
[build]extra — build/handler work needsuv sync --extra build. Artifact integrity is size- and sha256-based with a default warn-and-adopt policy (RAINCLOUD_STRICT_CHECKSUM=1for a hard gate), and locally-built artifacts are trusted by provenance rather than rebuilt on every load. Per-area detail — env vars, resolution order, integrity semantics — is in the CHANGELOG0.2.0entry.🤖 Generated with Claude Code