Skip to content

feat: scaffolding, caching, EGFR#65

Open
tristan-f-r wants to merge 60 commits intomainfrom
egfr-and-infrastructure
Open

feat: scaffolding, caching, EGFR#65
tristan-f-r wants to merge 60 commits intomainfrom
egfr-and-infrastructure

Conversation

@tristan-f-r
Copy link
Copy Markdown
Contributor

@tristan-f-r tristan-f-r commented Mar 18, 2026

We bundle EGFR along with the rest of the caching infrastructure. Notes:

  • All motivation for the caching system lives under cache/README.md.
  • We removed pra.yaml for now, as the only PRAs are the synthetic data and the ResponseNet data, and soon the DepMap data.
  • The CONTRIBUTING.md file is in Changes to CONTRIBUTING guide #57.
  • directory.py contains unnecessary files from other datasets that were deemed universal.

@tristan-f-r tristan-f-r added the enhancement New feature or request label Mar 18, 2026
@tristan-f-r tristan-f-r changed the title feat: initial scaffolding, EGFR feat: scaffolding, caching, EGFR Mar 18, 2026
Comment thread web/public/favicon.svg Outdated
Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a light review of the PR; did not look to hard at the code itself yet. I mostly was gathering ideas on what was happening from the READMEs.

Comment thread configs/scores.yaml
Comment thread tools/README.md
Comment thread .vscode/extensions.json Outdated
Comment thread .devcontainer/devcontainer.json
Comment thread cache/biomart/README.md Outdated
Comment thread README.md Outdated
Comment thread README.md
Comment thread README.md Outdated
Comment thread configs/dmmm.yaml Outdated
Comment thread configs/dmmm.yaml Outdated
tristan-f-r and others added 2 commits March 18, 2026 16:54
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
Comment thread cache/README.md Outdated
Comment thread cache/README.md Outdated
@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented Apr 21, 2026

I implemented the approach to versioning above, which I'll document, but effectively:

  • If a person wants to just use 'STRING', they say 'STRING/latest' and it will point to STRING v12
  • If a dataset wants to use a specific version of STRING, they say 'STRING/v12'

This is naively implemented by allowing strings in directory that point to other keys.

@tristan-f-r tristan-f-r requested a review from ntalluri April 23, 2026 19:23
Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another review of this PR.

Main issues:

  1. Why are we still erroring when pinned doesn't = cached? I thought we agreed that pinned != cached will be a warning at the last meeting.

  2. What happens when this error above occurs, how is a user supposed to fix it to allow them to download the data and get rid of the error?

  3. having data in directory.py and the dataset specific snakefiles doesn't make sense (and also having the cached files in two places is also confusing). We need to decide to either put everything in directory.py or in the dataset specific snakefiles and have the contributing guide reflect this.

  • How is a user supposed to know when their data is supposed to be used in multiple other datasets? That's a very large expectation for them to know that.
  1. Why is there no config for the egfr dataset collection?

Comment thread .vscode/settings.json
Comment thread cache/biomart/ensg-ensp.xml
Comment thread cache/biomart/README.md Outdated
Comment thread cache/cli.py Outdated
Comment thread cache/cli.py
Comment thread datasets/egfr/scripts/process_input_nodes.py
Comment thread datasets/egfr/scripts/process_input_nodes.py Outdated
Comment thread datasets/egfr/scripts/process_interactome.py
Comment thread datasets/egfr/scripts/process_input_nodes.py Outdated
Comment thread datasets/egfr/README.md
@ntalluri
Copy link
Copy Markdown
Collaborator

Also not sure why the workflow is breaking.

@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented Apr 29, 2026

As for (1) and (2), we did decide this! I don't know if I didn't push it or what, but I did completely intend to make that not an error. I forgot to remove the move/copy flag 👍 this has been removed in the non-styling commit after this comment.

@tristan-f-r tristan-f-r requested a review from ntalluri April 30, 2026 01:04
@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented Apr 30, 2026

My two notes:

  • How important is it that the 10 value stays 10? (e.g. does 1000 work just as well?) The methodology described in the paper says that it just needs to be a value greater than every other prize.
  • Unless (*) it is the case that we plan to have different algorithm parameter inputs for different dataset collections, I disagree with the methodology that the paper describes for a config per dataset collection, where I would instead believe it to be straightforward to continue as is. Otherwise, unless (*) is true, we directly contradict one of SPRAS's usability goals of running several datasets on several algorithms.

@ntalluri
Copy link
Copy Markdown
Collaborator

ntalluri commented Apr 30, 2026

Based on meeting:

  1. update to have datasets fetch configs either in directory.py or in a snakemake file that is dataset specific (one or the other not both)
  • this idea will be updated in the contributing guide as well
  1. keeping the value of 10 for the prizes for egfr
  • hard to justify but that's just the plan for now
  1. we will have 4 separate dataset collection configs (we can also keep the test-config too if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants