Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 16 additions & 3 deletions docs/_static/config/intermediate.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ datasets: # TODO update this based on the dataset that I set up
# Relative path from the spras directory where these files live
data_dir: "input"

# TODO: add the gold standard for egfr tutorial; already in SPRAS

reconstruction_settings:

# Set where everything is saved
Expand All @@ -111,9 +113,10 @@ analysis:
ml:
# ml analysis per dataset
include: false # set to true for step 3
# TODO: can I remove some of these arguments?
# adds ml analysis per algorithm output
# only runs for algorithms with multiple parameter combinations chosen
aggregate_per_algorithm: false
aggregate_per_algorithm: false # set to true for step 4??? Look at todo below
# specify how many principal components to calculate
components: 2
# boolean to show the labels on the pca graph
Expand All @@ -127,6 +130,16 @@ analysis:
# the coordinates of the KDE maximum (kde_peak) are also saved to the PCA coordinates output file.
# KDE needs to be run in order to select a parameter combination with PCA because the maximum kernel density is used
# to pick the 'best' parameter combination.
kde: false
kde: false # set to true for step 4
# TODO: double check that if I run step 3 without kde, then set it true in step 4 that pca is rerun
# removes empty pathways from consideration in ml analysis (pca only)
remove_empty_pathways: false
remove_empty_pathways: false # set to true for step 4
# TODO: double check why this was needed for pca

evaluation:
# evaluation per dataset-goldstandard pair
# evaluation will not run unless ml include is set to true
include: false # set to true for step 4
# adds evaluation per algorithm per dataset-goldstandard pair
# evaluation per algorithm will not run unless ml include and ml aggregate_per_algorithm are set to true
aggregate_per_algorithm: false # set to true for step 4 ??? TODO: decide if it is better to demonstrate what it looks like per algorithm
192 changes: 12 additions & 180 deletions docs/tutorial/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ in the configuration file. When executed, SPRAS automatically runs each
algorithm across all parameter combinations and collects the resulting
subnetworks.

# TODO maybe add in information about how parameter tuning seems to be
done now # add in more details about two stage parameter tuning

SPRAS will also support parameter refinement using graph topological
heuristics. These topological metrics help identify parameter regions
that produce biologically plausible outputs networks. Based on these
Expand All @@ -40,186 +43,8 @@ specific outputs for a given dataset.

.. note::

Some grid search features are still under development and will be
added in future SPRAS releases.

Parameter selection
===================

Parameter selection refers to the process of determining which parameter
combinations should be used for evaluation on a gold standard dataset.

Parameter selection is handled in the evaluation code, which supports
multiple parameter selection strategies. Once the grid space search is
complete for each dataset, the user can enable evaluation (by setting
evaluation ``include: true``) and it will run all of the parameter
selection code.

PCA-based parameter selection
-----------------------------

The PCA-based approach identifies a representative parameter setting for
each pathway reconstruction algorithm on a given dataset. It selects the
single parameter combination that best captures the central trend of an
algorithm's reconstruction behavior.

.. image:: ../_static/images/pca-kde.png
:alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top
:width: 600
:align: center

.. raw:: html

<div style="margin:20px 0;"></div>

For each algorithm, all reconstructed subnetworks are projected into an
algorithm-specific 2D PCA space based on the set of edges produced by
the respective parameter combinations for that algorithm. This
projection summarizes how the algorithm's outputs vary across different
parameter combinations, allowing patterns in the outputs to be
visualized in a lower-dimensional space.

Within each PCA space, a kernel density estimate (KDE) is computed over
the projected points to identify regions of high density. The output
closest to the highest KDE peak is selected as the most representative
parameter setting, as it corresponds to the region where the algorithm
most consistently produces similar subnetworks.

Ensemble network-based parameter selection
------------------------------------------

The ensemble-based approach combines results from all parameter settings
for each pathway reconstruction algorithm on a given dataset. Instead of
focusing on a single "best" parameter combination, it summarizes the
algorithm's overall reconstruction behavior across parameters.

All reconstructed subnetworks are merged into algorithm-specific
ensemble networks, where each edge weight reflects how frequently that
interaction appears across the outputs. Edges that occur more often are
assigned higher weights, highlighting interactions that are most
consistently recovered by the algorithm.

These consensus networks help identify the core patterns and overall
stability of an algorithm's output's without needing to choose a single
parameter setting (no clear optimal parameter combination could exists).

Ground truth-based evaluation without parameter selection
---------------------------------------------------------

The no parameter selection approach chooses all parameter combinations
for each pathway reconstruction algorithm on a given dataset. This
approach can be useful for idenitifying patterns in algorithm
performance without favoring any specific parameter setting.

************
Evaluation
************

In some cases, users may have a gold standard file that allows them to
evaluate the quality of the reconstructed subnetworks generated by
pathway reconstruction algorithms.

However, gold standards may not exist for certain types of experimental
data where validated ground truth interactions or molecules are
unavailable or incomplete. For example, in emerging research areas or
poorly characterized biological systems, interactions may not yet be
experimentally verified or fully known, making it difficult to define a
reliable reference network for evaluation.

Adding gold standard datasets and evaluation post analysis a configuration
==========================================================================

In the configuration file, users can specify one or more gold standard
datasets to evaluate the subnetworks reconstructed from each dataset.
When gold standards are provided and evaluation is enabled (``include:
true``), SPRAS will automatically compare the reconstructed subnetworks
for a specific dataset against the corresponding gold standards.

.. code:: yaml

gold_standards:
-
label: gs1
node_files: ["gs_nodes0.txt", "gs_nodes1.txt"]
data_dir: "input"
dataset_labels: ["data0"]
-
label: gs2
edge_files: ["gs_edges0.txt"]
data_dir: "input"
dataset_labels: ["data0", "data1"]

analysis:
evaluation:
include: true

A gold standard dataset must include the following types of keys and
files:

- ``label``: a name that uniquely identifies a gold standard dataset
throughout the SPRAS workflow and outputs.
- ``node_file`` or ``edge_file``: A list of node or edge files. Only
one of these can be defined per gold standard dataset.
- ``data_dir``: The file path of the directory where the input gold
standard dataset files are located.
- ``dataset_labels``: a list of dataset labels indicating which
datasets this gold standard dataset should be evaluated against.

When evaluation is enabled, SPRAS will automatically run its built-in
evaluation analysis on each defined dataset-gold standard pair. This
evaluation computes metrics such as precision, recall, and
precision-recall curves, depending on the parameter selection method
used.

For each pathway, evaluation can be run independently of any parameter
selection method (the ground truth-based evaluation without parameter
selection idea) to directly inspect precision and recall for each
reconstructed network from a given dataset.

.. image:: ../_static/images/pr-per-pathway-nodes.png
:alt: Precision and recall computed for each pathway and visualized on a scatter plot
:width: 600
:align: center

.. raw:: html

<div style="margin:20px 0;"></div>

Ensemble-based parameter selection generates precision-recall curves by
thresholding on the frequency of edges across an ensemble of
reconstructed networks for an algorithm for given dataset.

.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png
:alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve
:width: 600
:align: center

.. raw:: html

<div style="margin:20px 0;"></div>

PCA-based parameter selection computes a precision and recall for a
single reconstructed network selected using PCA from all reconstructed
networks for an algorithm for given dataset.

.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png
:alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot
:width: 600
:align: center

.. raw:: html

<div style="margin:20px 0;"></div>

.. note::

Evaluation will only execute if ml has ``include: true``, because the
PCA parameter selection step depends on the PCA ML analysis.

.. note::

To see evaluation in action, run SPRAS using the config.yaml or
egfr.yaml configuration files.
Grid search features are still under development and will be added in
future SPRAS releases.

**********************
HTCondor integration
Expand Down Expand Up @@ -255,3 +80,10 @@ user to set which SPRAS supported container framework to use:
framework: docker

The frameworks include Docker, Apptainer/Singularity, or dsub

***********************
Benchmarking Datasets
***********************

# add this part in # Should link to the benchmarking repo # We are
working on the vision of the live benchmarking website
10 changes: 10 additions & 0 deletions docs/tutorial/beginner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,13 @@ Conda environment and install the SPRAS python package:
The last command is a one-time installation of the SPRAS package into
the environment.

# The problem was that the participant downloaded the beginner config
file into the wrong directory and then the snakemake command failed #
They put it into the spras directory in the conda environment that was
created after the spras package is installed, may need to watch for that
# add a note about the folder called spras within the larger spras
folder

0.3 Test the installation
=========================

Expand All @@ -75,6 +82,9 @@ Launch Docker Desktop and wait until it says "Docker is running".
isolated containers. These containers include all the necessary
dependencies to run each algorithm or post analysis.

# Confusion about why Docker is followed by conda. Need to explain the
interaction between these pieces. # add this as a note

*****************************
Step 1: Configuration files
*****************************
Expand Down
Loading
Loading