Use Google Cloud Storage by default#145
Conversation
| @@ -27,18 +27,6 @@ def download( | |||
| ) | |||
There was a problem hiding this comment.
(This comment should be attached to the DataFile initialization above, GH review limitations disallowed)
question: Is this method structure too limiting and envision ourselves as the only customer?
We've had quite a few issues around prioritization HF/Google Cloud data storage. In order to make it easier for ourselves and to make .py more open to any user, I'd suggest implementing download() with an OOP strategy pattern, passing in two args: a string literal corresponding to one permitted strategy, perhaps providing a default (e.g., "google_cloud_bucket"), then another arg called "options" that was one of a variety of options schemas, e.g., gcs_bucket for Google downloads and the two HF options for Hugging Face.
This would do the following:
- Enable us to easily override the default behavior wherever we want without being tied to a particular prioritization
- Enable us to implement certain strategy-agnostic file management structures, like versioning and caching, that we would not need to tie to one vendor
- Enable us to build new download strategies as needed, e.g., if we ever hosted something on GitHub again or on BitBucket
- Enable external users to build their own download strategies and plug them in as needed
There was a problem hiding this comment.
suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.
Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.
| ) | ||
|
|
||
| logging.info = print | ||
| # NOTE: tests will break on build if you don't default to huggingface. |
There was a problem hiding this comment.
question, blocking: Is the UK non-public data hosted in a bucket? I thought I remembered it being GCP-only
| ) | ||
| return filepath | ||
|
|
||
| # NOTE: tests will break on build if you don't default to huggingface. |
There was a problem hiding this comment.
nit: This comment would no longer be true
anth-volk
left a comment
There was a problem hiding this comment.
@nikhilwoodruff had some thoughts for you around shifting how we do downloads to prevent future bugginess. Curious to hear your thoughts.
mikesmit
left a comment
There was a problem hiding this comment.
Let's discuss why hugginface at all. otherwise let's go.
| @@ -27,18 +27,6 @@ def download( | |||
| ) | |||
There was a problem hiding this comment.
suggestion, non-blocking My 2 cents: I agree directionally. But right now I'd simplify. Remove the hugging face entirely. The first customer is simulation service and that's all we need.
Get storage & caching working rock solid and usable from dev desktop, pipeline, and service. Then expand out.
| ) | ||
| return filepath | ||
|
|
||
| # NOTE: tests will break on build if you don't default to huggingface. |
| @@ -27,18 +27,6 @@ def download( | |||
| ) | |||
|
|
|||
| logging.info = print | |||
There was a problem hiding this comment.
Suggestion While we're in here, could this use the Python logger please? I'd make this blocking, but the switch to gcp is probably more important.
|
To do: just don't do huggingface in policyengine.py |
|
Closing, as my understanding is that this is superseded by #148. Feel free to reopen if I'm wrong @nikhilwoodruff. |
No description provided.