Thanks for your interest in Raincloud. This guide covers how to set up a dev
environment, run the test suite, and submit changes. For deeper dives into the
pipeline itself, see README.md, AGENTS.md, and
SKILLS.md.
git clone git@github.com:spiraldb/raincloud.git
cd raincloud
uv sync --extra dev --inexact--extra dev pulls in pytest. The base install is the lightweight loader
(pyarrow, numpy, vortex-data, fsspec); the heavy build toolchain
(duckdb, pandas, osmium, pyreadstat, openpyxl, β¦) now lives behind the build
extra, so add --extra build when working on the pipeline or handlers β
otherwise the build stages fail on a missing import. Add --extra kaggle or
--extra huggingface if your work touches those upstream types, or --extra all for everything. Always pass --inexact β without it, each uv sync --extra X removes the extras from the previous one (e.g. syncing --extra dev
after --extra huggingface uninstalls huggingface_hub).
Three sub-second checks are the minimum gate (CI runs all three):
ruff check # lint (pyflakes + pycodestyle + isort)
python -m scripts.pipeline.validate_manifest # JSON Schema + cross-checks on sources.json
pytest # smoke regression net (manifest, schema, registry, examples, loader)pytest now also covers the raincloud loader package (catalog resolution,
cache/mirror/build dispatch, sha256 integrity) alongside the manifest checks.
If you touched the build pipeline, install the build extra and run a small end-to-end build to make sure it still produces the expected output:
uv sync --extra build --inexact
python -m scripts.pipeline.build countries-of-the-world # ~200 ms, 227 rowsFor larger builds, see SKILLS.md.
- New datasets β see
SKILLS.md. Most entries copytemplates/minimal_spec.jsonand pick an existing handler fromdocs/v1/handlers.md. - New transform handlers β see
SKILLS.md. One handler per upstream shape; register inscripts/pipeline/handlers/__init__.py. - Bug fixes β start with a failing test where practical.
- Documentation β README/AGENTS/SKILLS edits welcome. The two derived docs
(
docs/datasets.md,docs/handlers.md) are machine-generated; don't hand-edit them β fix the manifest or the registry and regenerate viapython -m scripts.pipeline.docs.
Add a test alongside any new behaviour:
- New transform handler β a fixture-based test demonstrating the
expected output shape (small in-memory
pa.Table; see existing handler tests intests/test_manifest.pyfor the pattern). - New manifest field or schema rule β extend
test_manifest.pyto assert it validates as expected. - New CLI flag β extend the relevant
test_*.py(e.g.test_list_datasets.pyfor catalog-filter flags). - Bug fix β a failing test that the fix turns green.
pytest is the minimum pre-PR gate (see Before you open a PR);
CI re-runs it on every PR via .github/workflows/ci.yml.
- Branch off
develop. Branch names follow<initials>/<topic>(e.g.mp/add-fastlanes). - Open PRs against
develop. - Commit messages: short imperative subject ("add X", "fix Y", "swap Z to W"), optional body explaining why the change is needed.
Open an issue on GitHub Issues. Include the slug you were building, the command you ran, and any traceback.
For security-related issues, do not open a public issue β see
SECURITY.md for the private channel.
- Python β₯ 3.11. Match the style of nearby code; the repo prefers terse, comment-light Python with explicit names over abstractions.
- No backwards-compat stubs or shims when removing handlers/slugs β git history is the fallback.
- Always go through
scripts.pipeline.spec.duckdb_connectfor DuckDB connections so resource limits andstorage_compatibility_version=v1.5.0apply (seeAGENTS.md).
By submitting a PR, you agree that your contribution will be licensed under the Apache License 2.0, the same license that covers the rest of the project.