Skip to content

Silverman-Lab/mutt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

355 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Silverman Lab's Microbial Scale Data Repository

MUTT: Microbiome Utilizing Total Taxa


Authors: Maxwell Konnaris, Justin Silverman MD PhD

Contributors: Manan Saxena

Maintainers: Maxwell Konnaris, Justin Silverman MD PhD

Affiliation: Pennsylvania State University


Download and install

Before download, make sure gitlfs is installed.

# HTTPS (you’ll be prompted for a username and your PAT as password)
git clone https://github.com/Silverman-Lab/mutt.git

#—or via SSH (if you’ve added your SSH key to GitHub)
git clone git@github.com:Silverman-Lab/mutt.git

and then in R:

# Check working directory:
getwd()

# cd into the cloned directory:
setwd("path/to/mutt")

# Either:
# load the package into your session without installing:
devtools::load_all(".")

# Or, to build & install it so it behaves like any other installed package:
devtools::install_local(".", dependencies = TRUE)
# (or: remotes::install_local(".", dependencies = TRUE))

Contributing to the Repository and Description of Content:

Notes

Format to Maintain

Note on file paths

parse.R scripts should specify file paths relative to the root mutt directory.

Parsed Data Structure for parse.R scripts.

parse.R should have a single function named parse_[name of directory] (all lowercase) which returns a list object with the following elements. That function should not require arguments but they can be optional. Ideally, parse scripts use nothing other than base R or tidyverse functions to minimize dependencies and errors if certain libraries are not installed.

  • counts integer valued count matrix that is (N x D). sampleIDs (rows) and sequenceIDs (columns) (e.g., taxaIDs) respectively. Should contain a column key with sampleIDs linking to proportions, scale, and metadata.
  • proportions real-valued valued count matrix that is (N x D) and has row and column names which are sampleIDs and sequenceIDs (e.g., taxaIDs) respectively. Should contain a column key with sampleIDs linking to counts, scale, and metadata.
  • scale positive-valued matrix likely of dimension N x 1 but other formats may need to be allowed due to mean and sd or multiple techniques measuring total scale. Should contain a column key with sampleIDs linking to counts, proportions, and metadata.
  • metadata (optional but often required) N x Q data.frame. Should contain a column key with sampleIDs linking to counts and proportions.
  • tax (optional) D x ?, character-valued data.frame with sequenceIDs as rownames and each column labeled in a meaningful way. For microbiome data these labels should be limited to c("Kingdom", "Phylum", "Class", "Order", "Genus", "Species", "Strain"). "Taxa" is the lowest identified taxonomy classified specified by prefix and then the classified taxa, if unclassified by lowest taxonomy resolution then prefixed with uc_ and then taxonomic level prefix ex. for phylum: uc_p_[taxa classification name] "Sequence" column should additionally be the tax rownames and links sequenceIDs (ASV, OTUs, classified taxa) to the actual raw sequence (e.g., the 16S sequence of a particular taxon).
  • phylo (optional) phylogenetic tree stored in reasonable format (let me know if any repos have phylogenetic trees in them and I will figure out a good standard format)
# ----- Example Shotgun Metagenomics Study -----
return(list(
    counts = list(
      original = counts_original,
      reprocessed = list(
          mOTU3 = mOTU3_counts,
          MetaPhlAn4 = MetaPhlAn4_counts
      )
    ),
    proportions = list(
      original = proportions_original,
      reprocessed = list(
          mOTU3 = mOTU3_proportions,
          MetaPhlAn4 = MetaPhlAn4_proportions
      )
    ),
    tax = list(
      original = tax_original,
      reprocessed = list(
          mOTU3 = mOTU3_tax,
          MetaPhlAn4 = MetaPhlAn4_tax
      )
    ),
    scale = scale,
    metadata = metadata,
    phylo = NA
))

# ---- Example amplicon study ----
return(list(
    counts = list(
        original = counts_original,
        reprocessed = counts_reprocessed
    ),
    tax = list(
        original = tax_original,
        reprocessed = tax_reprocessed
    ),
    proportions = list(
        original = proportions_original,
        reprocessed = proportions_reprocessed
    ),
    metadata = metadata,
    scale = scale,
    phylo = NA
))

Helper Scripts:

  • inst/python/obtainpublicationinfo_pmid.py using a list of PMIDs, this functionality can be integrated into each parser to obtain the manuscript information from NCBI (Script works, but python and each parse script is in R.)
  • zip-push-gitlfs.sh run from terminal in the repository directory when you are ready to push and it will compress your files with .zip and upload with gitlfs
  • inst/python/loadRDataintopython.ipynb example using rpy2 to load the RData object (without .pkl) returned from R/microbialscalerepo.R into python

Usage and Functionality (so far):

Main function for mutt package:

  • R/microbialscalerepo.R function to call parse scripts (with selection of individual studies) and optionally store in .Rdata object
    • Default tries all parse.R scripts. Currently, not all parse.R scripts are finished so warnings() will appear.
# Choose whichever dataset you want and supply like this or as named vector, or just supply a vector of the repo directories:
study_parsers <- c(
    Vandeputte2021 = "2021_vandeputte_naturecommunications_flow_timeseries",
    CvandeVelde2022 = "2022_cvandevelde_ismecommunications_culturedflowhumanfecal",
    Vandeputte2017 = "2017_vandeputte_nature_flow",
    Pereira2023 = "2023_pereira_nature_nervous",
    Krawczyk2022 = "2022_krawczyk_microbiome_tickgeographicaldistributionqpcr",
    Liao2021 = "2021_liao_scientificdata_longitudinalmicrobiomeqpcr_allohct",
    Stammler2016 = "2016_stammler_microbiome_micehuman",
    Dreier2022 = "2022_dreier_bmcmicrobiology_cheeseqpcr",
    GALAXY = "2024_nishijima_cell_galaxy",
    MetaCardis = "2022_fromentin_naturemedicine_metacardissubset",
    Marotz2021 = "2021_marotz_mSystems_oral_mouthwash",
    Vieira_Silva2019 = "2019_vieirasilva_naturemicrobiology_pscibd",
    Contijoch2019 = "2019_contijoch_elife_multispeciesqPCRshotgunandamplicon",
    Tunsakul2024 = "2024_tunsakul_peerj_aerobicvsanaerobicinhealthyvsobesity",
    Alessandri2024 = "2024_alessandri_microbbiotechnology_pcosvaginalmicrobiota",
    Maghini2023 = "2023_maghini_naturebiotechnology_samplemesurement",
    Garcia_Martinez2024 = "2024_garciamartinez_bmcmicrobiology_ckdanddysbiosiswithserum",
    Sternes2024 = "2024_sternes_frontmicrobiol_IBDppiqPCR",
    Rao2021 = "2021_rao_nature_mkspikeseqmetagenomicmultiplescalequantification",
    Tettamanti_Boshier2020 = "2020_tettamantiboshier_msystems_vaginaltimeseries",
    Kruger2024 = "2024_kruger_scientificreports_ddpcrhealthysubjects",
    Liu2017 = "2017_liu_mbio_penilehivqPCR",
    Fu2023 = "2023_fu_imeta_wasterwater_pathogens",
    Jin2022 = "2022_jin_natureComm_technicalReplicates",
    Zaramela2022 = "2022_zaramela_msystems_synDNA",
    Feng2023 = "2023_feng_imetawiley_chickensegment",
    Reese2022 = "2021_reese_cell_chimpanzee",
    Barlow2020 = "2020_barlow_naturecommunications_miceGI",
    Morton2019 = "2019_morton_naturecommunications_songbird_oral",
    Prochazkova2024 = "2024_prochazkova_naturemicrobiology_longitudinalhealthyflowfecal",
    Zemb2020 = "2020_zemb_microOpen_spike",
    Jin2024 = "2024_jin_pnas_semen",
    Galazzo2020 = "2020_galazzo_frontiersincellularandinfectionmicrobiology_flowqPCRddPCRhealthy",
    Lin2019 = "2019_lin_applenvironmicrobiol_16s18smarineecologyflowandspikein",
    Suriano2022 = "2022_suriano_aps_micefecal",
    Thiruppathy2025 = "2025_thiruppathy_microbiome_relicDNAflow",
    Wagner2025 = "2025_wagner_frontiersinmiccrobiology_flowpiglets",
    Kallastu2023 = "2023_kallastu_research_foodscience_food"
)

# Run repo function
repo <- mutt(
  studies = study_parsers, # If not supplied, defaults to all
  base_directory = ".", # This is default, but you should change to wherever the mutt directory is with relative path from your getwd() or with absolute path.
  rawdata = FALSE, # Dont change this because its the un-reformatted original data non cleaned. If TRUE, returns unformatted original data
  align_samples = TRUE, # If TRUE, this will align your matrices to the scale dataframe so all sample data is aligned (If it can be)
  save_to = "datasetsfromrepo.RData", # OPTIONAL, save RData object of all the studies you chose.
  verbose = TRUE # Display structure of datasets returned if TRUE
)

This will supply you with the three saved objects:

  • *.RData object containing all of the parsed data formatted from each study
  • *_validation.RData containing the validated structure of the parsed data from each study contained in the .RData object
  • *_validation_summary.txt containing the skeleton summarized structure of what is within the .RData object and if verbose = TRUE, this is printed to console. See example below:
Study: Vandeputte2017
Structure:
  counts: LIST={original = LIST={rdp19 = DATAFRAME(135 x 193), rdp16 = DATAFRAME(135 x 140)}, nishijima2024 = DATAFRAME(95 x 166), reprocessed = LIST={rdp19 = DATAFRAME(135 x 334), rdp16 = DATAFRAME(135 x 222)}}
  proportions: LIST={original = LIST={rdp19 = DATAFRAME(135 x 193), rdp16 = DATAFRAME(135 x 140)}, nishijima2024 = DATAFRAME(95 x 166), reprocessed = LIST={rdp19 = DATAFRAME(135 x 334), rdp16 = DATAFRAME(135 x 222)}}
  tax: LIST={original = LIST={rdp19 = DATAFRAME(2041 x 14), rdp16 = DATAFRAME(2041 x 7)}, nishijima2024 = DATAFRAME(166 x 1), reprocessed = LIST={rdp19 = DATAFRAME(5173 x 7), rdp16 = DATAFRAME(5173 x 7)}}
  scale: DATAFRAME(321 x 11)
  metadata: DATAFRAME(321 x 56)
  studydemographics: LIST={sequencingtype = VECTOR(length=1), loadtype = VECTOR(length=1), organismtype = VECTOR(length=1), sampletype = VECTOR(length=1), covariates = VECTOR(length=1), ID = VECTOR(length=1), PMID = VECTOR(length=1)}

Additionally, there are many callable helper functions:

Helper functions for mutt package:

  • convert_repo_to_pkl() converts and saves the returned parsed repo object to a .pkl file
  • annotate_studies() modular function that appends data as a list returned from external script to the repo object such as scraped data from PMID/Pubmed
  • filter_studies() filter to dataframes of interest within each study. You can choose a branch and it will prune the paired dataframes
# WORK IN PROGRESS - Works for inst/python/obtainpublicationinfo_pmid.py

# library(reticulate) is required for python packages
annotate_studies <-             function(repo,
                                         script_path, # Supply any script (R or Python) that works off the structure of the repo such as obtainpublicationinfo_pmid.py
                                         email         = NA_character_,
                                         api_key       = NA_character_,
                                         overwrite     = TRUE,
                                         csv_copy_to   = "publication_data.csv") 

Code of Conduct

Please note that the mutt project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

R-CMD-check

About

Microbial Scale Data Repository from The Silverman Lab

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages