-
Notifications
You must be signed in to change notification settings - Fork 8
Use Google Cloud Storage by default #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| - bump: patch | ||
| changes: | ||
| fixed: | ||
| - Default storage location is GCP. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,18 +27,6 @@ def download( | |
| ) | ||
|
|
||
| logging.info = print | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion While we're in here, could this use the Python logger please? I'd make this blocking, but the switch to gcp is probably more important. |
||
| # NOTE: tests will break on build if you don't default to huggingface. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question, blocking: Is the UK non-public data hosted in a bucket? I thought I remembered it being GCP-only |
||
| if data_file.huggingface_repo is not None: | ||
| logging.info("Using Hugging Face for download.") | ||
| try: | ||
| return download_from_hf( | ||
| repo=data_file.huggingface_org | ||
| + "/" | ||
| + data_file.huggingface_repo, | ||
| repo_filename=data_file.filepath, | ||
| ) | ||
| except: | ||
| logging.info("Failed to download from Hugging Face.") | ||
|
|
||
| if Path(filepath).exists(): | ||
| logging.info(f"File {filepath} already exists. Skipping download.") | ||
|
|
@@ -53,6 +41,19 @@ def download( | |
| ) | ||
| return filepath | ||
|
|
||
| # NOTE: tests will break on build if you don't default to huggingface. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: This comment would no longer be true
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. |
||
| elif data_file.huggingface_repo is not None: | ||
| logging.info("Using Hugging Face for download.") | ||
| try: | ||
| return download_from_hf( | ||
| repo=data_file.huggingface_org | ||
| + "/" | ||
| + data_file.huggingface_repo, | ||
| repo_filename=data_file.filepath, | ||
| ) | ||
| except: | ||
| logging.info("Failed to download from Hugging Face.") | ||
|
|
||
| raise ValueError( | ||
| "No valid download method specified. Please provide either a Hugging Face repo or a Google Cloud Storage bucket." | ||
| ) | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment should be attached to the
DataFileinitialization above, GH review limitations disallowed)question: Is this method structure too limiting and envision ourselves as the only customer?
We've had quite a few issues around prioritization HF/Google Cloud data storage. In order to make it easier for ourselves and to make
.pymore open to any user, I'd suggest implementingdownload()with an OOP strategy pattern, passing in two args: a string literal corresponding to one permitted strategy, perhaps providing a default (e.g., "google_cloud_bucket"), then another arg called "options" that was one of a variety of options schemas, e.g.,gcs_bucketfor Google downloads and the two HF options for Hugging Face.This would do the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.
Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.