diff --git a/docs/_static/config/intermediate.yaml b/docs/_static/config/intermediate.yaml index 58d1400d..6d93fc2f 100644 --- a/docs/_static/config/intermediate.yaml +++ b/docs/_static/config/intermediate.yaml @@ -100,6 +100,8 @@ datasets: # TODO update this based on the dataset that I set up # Relative path from the spras directory where these files live data_dir: "input" +# TODO: add the gold standard for egfr tutorial; already in SPRAS + reconstruction_settings: # Set where everything is saved @@ -111,9 +113,10 @@ analysis: ml: # ml analysis per dataset include: false # set to true for step 3 + # TODO: can I remove some of these arguments? # adds ml analysis per algorithm output # only runs for algorithms with multiple parameter combinations chosen - aggregate_per_algorithm: false + aggregate_per_algorithm: false # set to true for step 4??? Look at todo below # specify how many principal components to calculate components: 2 # boolean to show the labels on the pca graph @@ -127,6 +130,16 @@ analysis: # the coordinates of the KDE maximum (kde_peak) are also saved to the PCA coordinates output file. # KDE needs to be run in order to select a parameter combination with PCA because the maximum kernel density is used # to pick the 'best' parameter combination. - kde: false + kde: false # set to true for step 4 + # TODO: double check that if I run step 3 without kde, then set it true in step 4 that pca is rerun # removes empty pathways from consideration in ml analysis (pca only) - remove_empty_pathways: false + remove_empty_pathways: false # set to true for step 4 + # TODO: double check why this was needed for pca + + evaluation: + # evaluation per dataset-goldstandard pair + # evaluation will not run unless ml include is set to true + include: false # set to true for step 4 + # adds evaluation per algorithm per dataset-goldstandard pair + # evaluation per algorithm will not run unless ml include and ml aggregate_per_algorithm are set to true + aggregate_per_algorithm: false # set to true for step 4 ??? TODO: decide if it is better to demonstrate what it looks like per algorithm diff --git a/docs/tutorial/advanced.rst b/docs/tutorial/advanced.rst index 368d9cb4..b1227391 100644 --- a/docs/tutorial/advanced.rst +++ b/docs/tutorial/advanced.rst @@ -27,6 +27,9 @@ in the configuration file. When executed, SPRAS automatically runs each algorithm across all parameter combinations and collects the resulting subnetworks. +# TODO maybe add in information about how parameter tuning seems to be +done now # add in more details about two stage parameter tuning + SPRAS will also support parameter refinement using graph topological heuristics. These topological metrics help identify parameter regions that produce biologically plausible outputs networks. Based on these @@ -40,186 +43,8 @@ specific outputs for a given dataset. .. note:: - Some grid search features are still under development and will be - added in future SPRAS releases. - -Parameter selection -=================== - -Parameter selection refers to the process of determining which parameter -combinations should be used for evaluation on a gold standard dataset. - -Parameter selection is handled in the evaluation code, which supports -multiple parameter selection strategies. Once the grid space search is -complete for each dataset, the user can enable evaluation (by setting -evaluation ``include: true``) and it will run all of the parameter -selection code. - -PCA-based parameter selection ------------------------------ - -The PCA-based approach identifies a representative parameter setting for -each pathway reconstruction algorithm on a given dataset. It selects the -single parameter combination that best captures the central trend of an -algorithm's reconstruction behavior. - -.. image:: ../_static/images/pca-kde.png - :alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top - :width: 600 - :align: center - -.. raw:: html - -
- -For each algorithm, all reconstructed subnetworks are projected into an -algorithm-specific 2D PCA space based on the set of edges produced by -the respective parameter combinations for that algorithm. This -projection summarizes how the algorithm's outputs vary across different -parameter combinations, allowing patterns in the outputs to be -visualized in a lower-dimensional space. - -Within each PCA space, a kernel density estimate (KDE) is computed over -the projected points to identify regions of high density. The output -closest to the highest KDE peak is selected as the most representative -parameter setting, as it corresponds to the region where the algorithm -most consistently produces similar subnetworks. - -Ensemble network-based parameter selection ------------------------------------------- - -The ensemble-based approach combines results from all parameter settings -for each pathway reconstruction algorithm on a given dataset. Instead of -focusing on a single "best" parameter combination, it summarizes the -algorithm's overall reconstruction behavior across parameters. - -All reconstructed subnetworks are merged into algorithm-specific -ensemble networks, where each edge weight reflects how frequently that -interaction appears across the outputs. Edges that occur more often are -assigned higher weights, highlighting interactions that are most -consistently recovered by the algorithm. - -These consensus networks help identify the core patterns and overall -stability of an algorithm's output's without needing to choose a single -parameter setting (no clear optimal parameter combination could exists). - -Ground truth-based evaluation without parameter selection ---------------------------------------------------------- - -The no parameter selection approach chooses all parameter combinations -for each pathway reconstruction algorithm on a given dataset. This -approach can be useful for idenitifying patterns in algorithm -performance without favoring any specific parameter setting. - -************ - Evaluation -************ - -In some cases, users may have a gold standard file that allows them to -evaluate the quality of the reconstructed subnetworks generated by -pathway reconstruction algorithms. - -However, gold standards may not exist for certain types of experimental -data where validated ground truth interactions or molecules are -unavailable or incomplete. For example, in emerging research areas or -poorly characterized biological systems, interactions may not yet be -experimentally verified or fully known, making it difficult to define a -reliable reference network for evaluation. - -Adding gold standard datasets and evaluation post analysis a configuration -========================================================================== - -In the configuration file, users can specify one or more gold standard -datasets to evaluate the subnetworks reconstructed from each dataset. -When gold standards are provided and evaluation is enabled (``include: -true``), SPRAS will automatically compare the reconstructed subnetworks -for a specific dataset against the corresponding gold standards. - -.. code:: yaml - - gold_standards: - - - label: gs1 - node_files: ["gs_nodes0.txt", "gs_nodes1.txt"] - data_dir: "input" - dataset_labels: ["data0"] - - - label: gs2 - edge_files: ["gs_edges0.txt"] - data_dir: "input" - dataset_labels: ["data0", "data1"] - - analysis: - evaluation: - include: true - -A gold standard dataset must include the following types of keys and -files: - -- ``label``: a name that uniquely identifies a gold standard dataset - throughout the SPRAS workflow and outputs. -- ``node_file`` or ``edge_file``: A list of node or edge files. Only - one of these can be defined per gold standard dataset. -- ``data_dir``: The file path of the directory where the input gold - standard dataset files are located. -- ``dataset_labels``: a list of dataset labels indicating which - datasets this gold standard dataset should be evaluated against. - -When evaluation is enabled, SPRAS will automatically run its built-in -evaluation analysis on each defined dataset-gold standard pair. This -evaluation computes metrics such as precision, recall, and -precision-recall curves, depending on the parameter selection method -used. - -For each pathway, evaluation can be run independently of any parameter -selection method (the ground truth-based evaluation without parameter -selection idea) to directly inspect precision and recall for each -reconstructed network from a given dataset. - -.. image:: ../_static/images/pr-per-pathway-nodes.png - :alt: Precision and recall computed for each pathway and visualized on a scatter plot - :width: 600 - :align: center - -.. raw:: html - - - -Ensemble-based parameter selection generates precision-recall curves by -thresholding on the frequency of edges across an ensemble of -reconstructed networks for an algorithm for given dataset. - -.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png - :alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve - :width: 600 - :align: center - -.. raw:: html - - - -PCA-based parameter selection computes a precision and recall for a -single reconstructed network selected using PCA from all reconstructed -networks for an algorithm for given dataset. - -.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png - :alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot - :width: 600 - :align: center - -.. raw:: html - - - -.. note:: - - Evaluation will only execute if ml has ``include: true``, because the - PCA parameter selection step depends on the PCA ML analysis. - -.. note:: - - To see evaluation in action, run SPRAS using the config.yaml or - egfr.yaml configuration files. + Grid search features are still under development and will be added in + future SPRAS releases. ********************** HTCondor integration @@ -255,3 +80,10 @@ user to set which SPRAS supported container framework to use: framework: docker The frameworks include Docker, Apptainer/Singularity, or dsub + +*********************** + Benchmarking Datasets +*********************** + +# add this part in # Should link to the benchmarking repo # We are +working on the vision of the live benchmarking website diff --git a/docs/tutorial/beginner.rst b/docs/tutorial/beginner.rst index a0846666..9c7b4ea6 100644 --- a/docs/tutorial/beginner.rst +++ b/docs/tutorial/beginner.rst @@ -50,6 +50,13 @@ Conda environment and install the SPRAS python package: The last command is a one-time installation of the SPRAS package into the environment. +# The problem was that the participant downloaded the beginner config +file into the wrong directory and then the snakemake command failed # +They put it into the spras directory in the conda environment that was +created after the spras package is installed, may need to watch for that +# add a note about the folder called spras within the larger spras +folder + 0.3 Test the installation ========================= @@ -75,6 +82,9 @@ Launch Docker Desktop and wait until it says "Docker is running". isolated containers. These containers include all the necessary dependencies to run each algorithm or post analysis. +# Confusion about why Docker is followed by conda. Need to explain the +interaction between these pieces. # add this as a note + ***************************** Step 1: Configuration files ***************************** diff --git a/docs/tutorial/intermediate.rst b/docs/tutorial/intermediate.rst index 1055b879..9c50ba12 100644 --- a/docs/tutorial/intermediate.rst +++ b/docs/tutorial/intermediate.rst @@ -1124,8 +1124,262 @@ algorithms and their parameter settings. Higher similarity values indicate that pathways share many of the same edges, while lower values suggest distinct reconstructions. -References -========== +************************************** + Step 4: Use Evaluation post-analysis +************************************** + +In some cases, users may have a gold standard file that allows them to +evaluate the quality of the reconstructed subnetworks generated by +pathway reconstruction algorithms. + +However, gold standards may not exist for certain types of experimental +data where validated ground truth interactions or molecules are +unavailable or incomplete. For example, in emerging research areas or +poorly characterized biological systems, interactions may not yet be +experimentally verified or fully known, making it difficult to define a +reliable reference network for evaluation. + +Explain two sentence high level how we do evaluation: Parameter +selection + pr or prcs + +4.1 Adding evaluation post-analysis to the intermediate configuration +===================================================================== + +To enable the evaluation, update the analysis section in your +configuration file by setting evaluation to true. TODO ALSO UPDATE THE +OTHER THINGS TOO + +Your analysis section in the configuration file should look like this: + +.. code:: yaml + + analysis: + ml: + include: true + aggregate_per_algorithm: true + ... (other parameters preset) + kde: true + remove_empty_pathways: true + + evaluation: + include: true + aggregate_per_algorithm: true + +EXPLAIN WHY WE do each of these - kde we explain in parameter selection +so skip - remove_empty_pathways we do because we don't want to +cluster/kda and choose empty pathways that are the representative, want +to see something - aggregate_per_algorithm we want to see how well each +individual algorithm does on the evaluation instead of the # 1 best or +all outputs treated the same, we want to see how each algorithm is +performing + +What do gold standard datasets look like in a configuration? +------------------------------------------------------------ + +In the configuration file, users can specify one or more gold standard +datasets to evaluate the subnetworks reconstructed from each dataset. +When gold standards are provided and evaluation is enabled (``include: +true``), SPRAS will automatically compare the reconstructed subnetworks +for a specific dataset against the corresponding gold standards. + +.. code:: yaml + + gold_standards: + - + label: gs1 + node_files: ["gs_nodes0.txt", "gs_nodes1.txt"] + data_dir: "input" + dataset_labels: ["data0"] + - + label: gs2 + edge_files: ["gs_edges0.txt"] + data_dir: "input" + dataset_labels: ["data0", "data1"] + + analysis: + evaluation: + include: true + +A gold standard dataset must include the following types of keys and +files: + +- ``label``: a name that uniquely identifies a gold standard dataset + throughout the SPRAS workflow and outputs. +- ``node_file`` or ``edge_file``: A list of node or edge files. Only + one of these can be defined per gold standard dataset. +- ``data_dir``: The file path of the directory where the input gold + standard dataset files are located. +- ``dataset_labels``: a list of dataset labels indicating which + datasets this gold standard dataset should be evaluated against. + +# add a note that gold standard datasets must be defined as nodes or +edges (double check that if the edges are only added if it will run node +and edge) + +add the code thing of what the gold standard looks like for the one we +will be using config + +When evaluation is enabled, SPRAS will automatically run its built-in +evaluation analysis on each defined dataset-gold standard pair. This +evaluation computes metrics such as precision, recall, and +precision-recall curves, depending on the parameter selection method +used. + +4.2 What is parameter selection? +================================ + +Parameter selection refers to the process of determining which parameter +combinations should be used for evaluation on a gold standard dataset. + +Parameter selection is handled in the evaluation code, which supports +multiple parameter selection strategies. Once the grid space search is +complete for each dataset, the user can enable evaluation (by setting +evaluation ``include: true``) and it will run all of the parameter +selection code. + +.. note:: + + Some parameter selection features are still under development and + will be added in future SPRAS releases. + +PCA-based parameter selection +----------------------------- + +The PCA-based approach identifies a representative parameter setting for +each pathway reconstruction algorithm on a given dataset. It selects the +single parameter combination that best captures the central trend of an +algorithm's reconstruction behavior. + +.. image:: ../_static/images/pca-kde.png + :alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top + :width: 600 + :align: center + +.. raw:: html + + + +For each algorithm, all reconstructed subnetworks are projected into an +algorithm-specific 2D PCA space based on the set of edges produced by +the respective parameter combinations for that algorithm. This +projection summarizes how the algorithm's outputs vary across different +parameter combinations, allowing patterns in the outputs to be +visualized in a lower-dimensional space. + +Within each PCA space, a kernel density estimate (KDE) is computed over +the projected points to identify regions of high density. The output +closest to the highest KDE peak is selected as the most representative +parameter setting, as it corresponds to the region where the algorithm +most consistently produces similar subnetworks. + +Ensemble network-based parameter selection +------------------------------------------ + +The ensemble-based approach combines results from all parameter settings +for each pathway reconstruction algorithm on a given dataset. Instead of +focusing on a single "best" parameter combination, it summarizes the +algorithm's overall reconstruction behavior across parameters. + +All reconstructed subnetworks are merged into algorithm-specific +ensemble networks, where each edge weight reflects how frequently that +interaction appears across the outputs. Edges that occur more often are +assigned higher weights, highlighting interactions that are most +consistently recovered by the algorithm. + +These consensus networks help identify the core patterns and overall +stability of an algorithm's output's without needing to choose a single +parameter setting (no clear optimal parameter combination could exists). + +Ground truth-based evaluation without parameter selection +--------------------------------------------------------- + +# TODO rename this to what it actually is + +The no parameter selection approach chooses all parameter combinations +for each pathway reconstruction algorithm on a given dataset. This +approach can be useful for idenitifying patterns in algorithm +performance without choosing any specific parameter setting. + +# add more details about this/reword this based on what is in the paper + +4.3 Running evaluation post analysis code +========================================= + +With the updates to the intermediate.yaml config, SPRAS will run the +full evalaution across all outputs for a given dataset and give back +results per algorithm. + +After saving the changes in the configuration file, rerun with: + +.. code:: bash + + snakemake --cores 4 --configfile config/intermediate.yaml + +What happens when you run this command +-------------------------------------- + +What your directory structure should like after this run: +--------------------------------------------------------- + +4.4 Reviewing the evalaution outputs +==================================== + +MAKE SURE TO UPDATE IMAGES TO WHAT THEY ARE FOR THE EGFR EXAMPLE - add +how to look up each of these images + +For each pathway, evaluation can be run independently of any parameter +selection method (the ground truth-based evaluation without parameter +selection idea) to directly inspect precision and recall for each +reconstructed network from a given dataset. + +.. image:: ../_static/images/pr-per-pathway-nodes.png + :alt: Precision and recall computed for each pathway and visualized on a scatter plot + :width: 600 + :align: center + +.. raw:: html + + + +Ensemble-based parameter selection generates precision-recall curves by +thresholding on the frequency of edges across an ensemble of +reconstructed networks for an algorithm for given dataset. + +.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png + :alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve + :width: 600 + :align: center + +.. raw:: html + + + +PCA-based parameter selection computes a precision and recall for a +single reconstructed network selected using PCA from all reconstructed +networks for an algorithm for given dataset. + +.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png + :alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot + :width: 600 + :align: center + +.. raw:: html + + + +.. note:: + + Evaluation will only execute if ml has ``include: true``, because the + PCA parameter selection step depends on the PCA ML analysis. + +.. note:: + + To see evaluation in action, run SPRAS using the config.yaml or + egfr.yaml configuration files. + +************ + References +************ .. [1]