Skip to content

Drop temporary CPS take-up anchors from H5 outputs#964

Merged
MaxGhenis merged 1 commit into
mainfrom
codex/drop-temp-cps-source-anchors
May 13, 2026
Merged

Drop temporary CPS take-up anchors from H5 outputs#964
MaxGhenis merged 1 commit into
mainfrom
codex/drop-temp-cps-source-anchors

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented May 12, 2026

Summary

  • delete temporary SNAP/SSI take-up source anchors from persisted CPS H5 outputs after append-style saves
  • reject temporary or retired reported/source variables during dataset upload validation
  • add an eCPS-only data-build path that can skip source-imputed CPS and small enhanced CPS when only enhanced_cps_2024.h5 is needed
  • checkpoint enhanced_cps clone diagnostics alongside the H5 so eCPS-only runs can validate and stage the sidecar
  • record skipped Phase 5 in the Stage 1 contract and exclude stale small enhanced CPS artifacts from eCPS-only upload/promote

Where this broke

The prior production data run was Run Pipeline 25751771946, for commit d4871933 / version 1.112.3. That GitHub job completed because it only launched the Modal pipeline, but the Modal eCPS checkpoint from that run contained the temporary source anchors snap_reported and ssi_reported.

With this PR's validator, that checkpointed H5 fails before upload:

Validation failed for enhanced_cps_2024.h5:
  - Dataset contains temporary or retired variables: snap_reported, ssi_reported

The root cause is that Dataset.save_dataset appends/overwrites H5 keys but does not delete keys that were popped from the in-memory dict.

Tests

  • uv run ruff format policyengine_us_data/datasets/cps/cps.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/stage_contracts/dataset_build.py modal_app/data_build.py tests/integration/test_cps_generation.py tests/unit/test_upload_completed_datasets.py tests/unit/test_modal_data_build.py tests/unit/test_dataset_build_stage_contract.py
  • uv run ruff check policyengine_us_data/datasets/cps/cps.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/stage_contracts/dataset_build.py modal_app/data_build.py tests/integration/test_cps_generation.py tests/unit/test_upload_completed_datasets.py tests/unit/test_modal_data_build.py tests/unit/test_dataset_build_stage_contract.py
  • uv run pytest tests/integration/test_cps_generation.py::test_add_takeup_removes_temporary_source_anchors_from_saved_h5 tests/integration/test_cps_generation.py::test_drop_persisted_dataset_variables_removes_stale_h5_keys tests/unit/test_upload_completed_datasets.py::test_validate_dataset_rejects_temporary_reported_source_variables -q
  • uv run pytest tests/unit/test_dataset_build_stage_contract.py tests/unit/test_modal_data_build.py tests/unit/test_upload_completed_datasets.py tests/integration/test_cps_generation.py tests/unit/datasets/test_cps_takeup.py -q

@MaxGhenis MaxGhenis force-pushed the codex/drop-temp-cps-source-anchors branch 3 times, most recently from 5bbd373 to d1ce64f Compare May 13, 2026 00:34
@MaxGhenis MaxGhenis force-pushed the codex/drop-temp-cps-source-anchors branch from d1ce64f to 19d9bda Compare May 13, 2026 00:38
@MaxGhenis MaxGhenis merged commit bea9230 into main May 13, 2026
11 of 12 checks passed
@MaxGhenis MaxGhenis deleted the codex/drop-temp-cps-source-anchors branch May 13, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant