Reed-CompBio · ntalluri · Apr 23, 2026
diff --git a/docs/_static/config/intermediate.yaml b/docs/_static/config/intermediate.yaml
@@ -100,6 +100,8 @@ datasets: # TODO update this based on the dataset that I set up
     # Relative path from the spras directory where these files live
     data_dir: "input"
 
+# TODO: add the gold standard for egfr tutorial; already in SPRAS
+
 reconstruction_settings:
 
   # Set where everything is saved
@@ -111,9 +113,10 @@ analysis:
   ml:
     # ml analysis per dataset
     include: false # set to true for step 3
+    # TODO: can I remove some of these arguments?
     # adds ml analysis per algorithm output
     # only runs for algorithms with multiple parameter combinations chosen
-    aggregate_per_algorithm: false
+    aggregate_per_algorithm: false # set to true for step 4??? Look at todo below
     # specify how many principal components to calculate
     components: 2
     # boolean to show the labels on the pca graph
@@ -127,6 +130,16 @@ analysis:
     # the coordinates of the KDE maximum (kde_peak) are also saved to the PCA coordinates output file.
     # KDE needs to be run in order to select a parameter combination with PCA because the maximum kernel density is used
     # to pick the 'best' parameter combination.
-    kde: false
+    kde: false # set to true for step 4
+    # TODO: double check that if I run step 3 without kde, then set it true in step 4 that pca is rerun
     # removes empty pathways from consideration in ml analysis (pca only)
-    remove_empty_pathways: false
+    remove_empty_pathways: false # set to true for step 4
+    # TODO: double check why this was needed for pca
+
+  evaluation:
+    # evaluation per dataset-goldstandard pair
+    # evaluation will not run unless ml include is set to true
+    include: false # set to true for step 4
+    # adds evaluation per algorithm per dataset-goldstandard pair
+    # evaluation per algorithm will not run unless ml include and ml aggregate_per_algorithm are set to true
+    aggregate_per_algorithm: false # set to true for step 4 ???  TODO: decide if it is better to demonstrate what it looks like per algorithm
diff --git a/docs/tutorial/advanced.rst b/docs/tutorial/advanced.rst
@@ -27,6 +27,9 @@ in the configuration file. When executed, SPRAS automatically runs each
 algorithm across all parameter combinations and collects the resulting
 subnetworks.
 
+# TODO maybe add in information about how parameter tuning seems to be
+done now # add in more details about two stage parameter tuning
+
 SPRAS will also support parameter refinement using graph topological
 heuristics. These topological metrics help identify parameter regions
 that produce biologically plausible outputs networks. Based on these
@@ -40,186 +43,8 @@ specific outputs for a given dataset.
 
 .. note::
 
-   Some grid search features are still under development and will be
-   added in future SPRAS releases.
-
-Parameter selection
-===================
-
-Parameter selection refers to the process of determining which parameter
-combinations should be used for evaluation on a gold standard dataset.
-
-Parameter selection is handled in the evaluation code, which supports
-multiple parameter selection strategies. Once the grid space search is
-complete for each dataset, the user can enable evaluation (by setting
-evaluation ``include: true``) and it will run all of the parameter
-selection code.
-
-PCA-based parameter selection
------------------------------
-
-The PCA-based approach identifies a representative parameter setting for
-each pathway reconstruction algorithm on a given dataset. It selects the
-single parameter combination that best captures the central trend of an
-algorithm's reconstruction behavior.
-
-.. image:: ../_static/images/pca-kde.png
-   :alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top
-   :width: 600
-   :align: center
-
-.. raw:: html
-
-   <div style="margin:20px 0;"></div>
-
-For each algorithm, all reconstructed subnetworks are projected into an
-algorithm-specific 2D PCA space based on the set of edges produced by
-the respective parameter combinations for that algorithm. This
-projection summarizes how the algorithm's outputs vary across different
-parameter combinations, allowing patterns in the outputs to be
-visualized in a lower-dimensional space.
-
-Within each PCA space, a kernel density estimate (KDE) is computed over
-the projected points to identify regions of high density. The output
-closest to the highest KDE peak is selected as the most representative
-parameter setting, as it corresponds to the region where the algorithm
-most consistently produces similar subnetworks.
-
-Ensemble network-based parameter selection
-------------------------------------------
-
-The ensemble-based approach combines results from all parameter settings
-for each pathway reconstruction algorithm on a given dataset. Instead of
-focusing on a single "best" parameter combination, it summarizes the
-algorithm's overall reconstruction behavior across parameters.
-
-All reconstructed subnetworks are merged into algorithm-specific
-ensemble networks, where each edge weight reflects how frequently that
-interaction appears across the outputs. Edges that occur more often are
-assigned higher weights, highlighting interactions that are most
-consistently recovered by the algorithm.
-
-These consensus networks help identify the core patterns and overall
-stability of an algorithm's output's without needing to choose a single
-parameter setting (no clear optimal parameter combination could exists).
-
-Ground truth-based evaluation without parameter selection
----------------------------------------------------------
-
-The no parameter selection approach chooses all parameter combinations
-for each pathway reconstruction algorithm on a given dataset. This
-approach can be useful for idenitifying patterns in algorithm
-performance without favoring any specific parameter setting.
-
-************
- Evaluation
-************
-
-In some cases, users may have a gold standard file that allows them to
-evaluate the quality of the reconstructed subnetworks generated by
-pathway reconstruction algorithms.
-
-However, gold standards may not exist for certain types of experimental
-data where validated ground truth interactions or molecules are
-unavailable or incomplete. For example, in emerging research areas or
-poorly characterized biological systems, interactions may not yet be
-experimentally verified or fully known, making it difficult to define a
-reliable reference network for evaluation.
-
-Adding gold standard datasets and evaluation post analysis a configuration
-==========================================================================
-
-In the configuration file, users can specify one or more gold standard
-datasets to evaluate the subnetworks reconstructed from each dataset.
-When gold standards are provided and evaluation is enabled (``include:
-true``), SPRAS will automatically compare the reconstructed subnetworks
-for a specific dataset against the corresponding gold standards.
-
-.. code:: yaml
-
-   gold_standards:
-       -
-       label: gs1
-       node_files: ["gs_nodes0.txt", "gs_nodes1.txt"]
-       data_dir: "input"
-       dataset_labels: ["data0"]
-       -
-       label: gs2
-       edge_files: ["gs_edges0.txt"]
-       data_dir: "input"
-       dataset_labels: ["data0", "data1"]
-
-   analysis:
-       evaluation:
-           include: true
-
-A gold standard dataset must include the following types of keys and
-files:
-
--  ``label``: a name that uniquely identifies a gold standard dataset
-   throughout the SPRAS workflow and outputs.
--  ``node_file`` or ``edge_file``: A list of node or edge files. Only
-   one of these can be defined per gold standard dataset.
--  ``data_dir``: The file path of the directory where the input gold
-   standard dataset files are located.
--  ``dataset_labels``: a list of dataset labels indicating which
-   datasets this gold standard dataset should be evaluated against.
-
-When evaluation is enabled, SPRAS will automatically run its built-in
-evaluation analysis on each defined dataset-gold standard pair. This
-evaluation computes metrics such as precision, recall, and
-precision-recall curves, depending on the parameter selection method
-used.
-
-For each pathway, evaluation can be run independently of any parameter
-selection method (the ground truth-based evaluation without parameter
-selection idea) to directly inspect precision and recall for each
-reconstructed network from a given dataset.
-
-.. image:: ../_static/images/pr-per-pathway-nodes.png
-   :alt: Precision and recall computed for each pathway and visualized on a scatter plot
-   :width: 600
-   :align: center
-
-.. raw:: html
-
-   <div style="margin:20px 0;"></div>
-
-Ensemble-based parameter selection generates precision-recall curves by
-thresholding on the frequency of edges across an ensemble of
-reconstructed networks for an algorithm for given dataset.
-
-.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png
-   :alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve
-   :width: 600
-   :align: center
-
-.. raw:: html
-
-   <div style="margin:20px 0;"></div>
-
-PCA-based parameter selection computes a precision and recall for a
-single reconstructed network selected using PCA from all reconstructed
-networks for an algorithm for given dataset.
-
-.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png
-   :alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot
-   :width: 600
-   :align: center
-
-.. raw:: html
-
-   <div style="margin:20px 0;"></div>
-
-.. note::
-
-   Evaluation will only execute if ml has ``include: true``, because the
-   PCA parameter selection step depends on the PCA ML analysis.
-
-.. note::
-
-   To see evaluation in action, run SPRAS using the config.yaml or
-   egfr.yaml configuration files.
+   Grid search features are still under development and will be added in
+   future SPRAS releases.
 
 **********************
  HTCondor integration
@@ -255,3 +80,10 @@ user to set which SPRAS supported container framework to use:
        framework: docker
 
 The frameworks include Docker, Apptainer/Singularity, or dsub
+
+***********************
+ Benchmarking Datasets
+***********************
+
+# add this part in # Should link to the benchmarking repo # We are
+working on the vision of the live benchmarking website
diff --git a/docs/tutorial/beginner.rst b/docs/tutorial/beginner.rst
@@ -50,6 +50,13 @@ Conda environment and install the SPRAS python package:
    The last command is a one-time installation of the SPRAS package into
    the environment.
 
+# The problem was that the participant downloaded the beginner config
+file into the wrong directory and then the snakemake command failed #
+They put it into the spras directory in the conda environment that was
+created after the spras package is installed, may need to watch for that
+# add a note about the folder called spras within the larger spras
+folder
+
 0.3 Test the installation
 =========================
 
@@ -75,6 +82,9 @@ Launch Docker Desktop and wait until it says "Docker is running".
    isolated containers. These containers include all the necessary
    dependencies to run each algorithm or post analysis.
 
+# Confusion about why Docker is followed by conda. Need to explain the
+interaction between these pieces. # add this as a note
+
 *****************************
  Step 1: Configuration files
 *****************************