Use a persistent location for caching data files#152
Conversation
|
Doing some research into this to better understand needs and options before reviewing |
mikesmit
left a comment
There was a problem hiding this comment.
Unless there is a really strong benefit I would not bother with this right now.
Actually, side note, if we had integrated metrics in policyengine.py we could easily see how frequently we're downloading and how long it takes.
I had mentioned "instrumentation" as a key requirement for policyengine.py a while back this is a great example where I wanted to add it but didn't have a mechanism.
| self.client = SimplifiedGoogleStorageClient() | ||
| self.cache = diskcache.Cache() | ||
| cache_folder = user_cache_dir("policyengine.py") | ||
| self.cache = diskcache.Cache(directory=cache_folder) | ||
|
|
||
| def _data_key( |
There was a problem hiding this comment.
I'm neutral to no on this PR. I don't think it ads much and it introduces complexity.
What does it actually do?
I think this isn't really going to have much of a performance impact.
The only time this will make a difference is if a process spins up and there is already a cache on disk and that cache already has a file. In that case it will re-use the cache.
There are only two reasons I think that happens
- A process that was running crashes and uvicorn restarts it (this should be infrequent)
- Multiple processes are running on the same container (currently we run 2) and this is the second one to try to get a specific file.
In all other cases we're spinning up a new container which does not share a disk with any other container.
Why I am not in a rush to do it
So I could have used a shared directory originally and I opted not to because
- As noted above, the improvement is minimal (we download a file twice instead of once on any individual container)
- The risk is also minimal, but annoying. This should work, but why throw more variables into "concurrency issues" mix by using a shared cache directory for multiple processes. If there are multi-process contention issues they'll be hard to find/recognize/address. The way it is now we just don't have to worry about it.
|
Re-upping because this realistically is a blocker to using policyengine.py for analysis- nobody wants to have 7 copies of the 100mb microdata polluting their file system |
Fixes #151