Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/any_changes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ jobs:
docs:
name: Test documentation builds
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- name: Checkout repo
uses: actions/checkout@v2
Expand All @@ -23,6 +26,10 @@ jobs:

- name: Install package
run: uv pip install .[dev] --system
- uses: "google-github-actions/auth@v2"
with:
workload_identity_provider: "projects/322898545428/locations/global/workloadIdentityPools/policyengine-research-id-pool/providers/prod-github-provider"
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"

- name: Test documentation builds
run: make documentation
Expand Down
11 changes: 8 additions & 3 deletions .github/workflows/code_changes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ jobs:
args: ". -l 79 --check"
Test:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- name: Checkout repo
uses: actions/checkout@v2
Expand All @@ -31,11 +34,13 @@ jobs:
uses: actions/setup-python@v2
with:
python-version: '3.11'
- uses: "google-github-actions/auth@v2"
with:
workload_identity_provider: "projects/322898545428/locations/global/workloadIdentityPools/policyengine-research-id-pool/providers/prod-github-provider"
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"

- name: Install package
run: uv pip install .[dev] --system

- name: Run tests
run: make test
env:
HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
run: make test
4 changes: 4 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- bump: patch
changes:
fixed:
- Default storage location is GCP.
7 changes: 0 additions & 7 deletions policyengine/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,6 @@
from policyengine_core.data import Dataset
from policyengine.utils.data_download import download

# Datasets
ENHANCED_FRS = "hf://policyengine/policyengine-uk-data/enhanced_frs_2022_23.h5"
FRS = "hf://policyengine/policyengine-uk-data/frs_2022_23.h5"
ENHANCED_CPS = "hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
CPS = "hf://policyengine/policyengine-us-data/cps_2023.h5"
POOLED_CPS = "hf://policyengine/policyengine-us-data/pooled_3_year_cps_2023.h5"


def get_default_dataset(country: str, region: str):
if country == "uk":
Expand Down
25 changes: 13 additions & 12 deletions policyengine/utils/data_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,6 @@ def download(
)
Copy link
Copy Markdown
Contributor

@anth-volk anth-volk May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This comment should be attached to the DataFile initialization above, GH review limitations disallowed)
question: Is this method structure too limiting and envision ourselves as the only customer?

We've had quite a few issues around prioritization HF/Google Cloud data storage. In order to make it easier for ourselves and to make .py more open to any user, I'd suggest implementing download() with an OOP strategy pattern, passing in two args: a string literal corresponding to one permitted strategy, perhaps providing a default (e.g., "google_cloud_bucket"), then another arg called "options" that was one of a variety of options schemas, e.g., gcs_bucket for Google downloads and the two HF options for Hugging Face.

This would do the following:

  • Enable us to easily override the default behavior wherever we want without being tied to a particular prioritization
  • Enable us to implement certain strategy-agnostic file management structures, like versioning and caching, that we would not need to tie to one vendor
  • Enable us to build new download strategies as needed, e.g., if we ever hosted something on GitHub again or on BitBucket
  • Enable external users to build their own download strategies and plug them in as needed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.

Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.


logging.info = print
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion While we're in here, could this use the Python logger please? I'd make this blocking, but the switch to gcp is probably more important.

# NOTE: tests will break on build if you don't default to huggingface.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: Is the UK non-public data hosted in a bucket? I thought I remembered it being GCP-only

if data_file.huggingface_repo is not None:
logging.info("Using Hugging Face for download.")
try:
return download_from_hf(
repo=data_file.huggingface_org
+ "/"
+ data_file.huggingface_repo,
repo_filename=data_file.filepath,
)
except:
logging.info("Failed to download from Hugging Face.")

if Path(filepath).exists():
logging.info(f"File {filepath} already exists. Skipping download.")
Expand All @@ -53,6 +41,19 @@ def download(
)
return filepath

# NOTE: tests will break on build if you don't default to huggingface.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This comment would no longer be true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

elif data_file.huggingface_repo is not None:
logging.info("Using Hugging Face for download.")
try:
return download_from_hf(
repo=data_file.huggingface_org
+ "/"
+ data_file.huggingface_repo,
repo_filename=data_file.filepath,
)
except:
logging.info("Failed to download from Hugging Face.")

raise ValueError(
"No valid download method specified. Please provide either a Hugging Face repo or a Google Cloud Storage bucket."
)
Loading