ClinTrialDataFlow is a teaching companion repository for the manuscript Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice.
The repository provides a lightweight, reproducible clinical trial data workflow that helps students see how data move from electronic data capture to SDTM-style domains, ADaM-style analysis datasets, and statistical tables, figures, and listings. It is designed for education, classroom demonstration, and method prototyping, not for production clinical trial operations.
This conceptual illustration from the manuscript frames the repository as a bridge from academic statistical training to pharmaceutical evidence generation and real-world clinical impact.
The illustration above summarizes the teaching workflow used throughout this repository: students trace synthetic oncology records from EDC/raw capture through SDTM-style domains, ADaM-style derivations, and final TFL outputs.
Many biostatistics students first encounter statistical methods through clean, analysis-ready datasets. In pharmaceutical practice, however, statistical work is embedded in a broader data lifecycle involving protocols, electronic case report forms, data management, statistical programming, derived endpoints, interim decisions, and regulatory-facing interpretation.
ClinTrialDataFlow was created to make that lifecycle visible in the classroom. It supports instructors who want to teach not only how to analyze clinical trial data, but also where those data come from, why they look the way they do, and how statistical decisions interact with operational and regulatory constraints.
The central workflow is:
EDC / Raw -> SDTM-style domains -> ADaM-style datasets -> TFL outputs
This repository includes:
- Simulated EDC/raw clinical trial data
- SDTM-style domain generation
- ADaM-style dataset derivation
- Statistical tables and figures for classroom discussion
- Oncology-oriented examples, including RECIST-like response, tumor burden, PFS, OS, TEAEs, and laboratory abnormalities
- Configurable missingness, dropout, death events, and query-like data imperfections
- A Streamlit interface for interactive classroom demonstration
All examples are synthetic or publicly discussable teaching examples. No confidential patient-level data are included.
ClinTrialDataFlow is intended to help students:
- Understand the lifecycle of a clinical trial data point
- Connect eCRF-style capture, raw datasets, SDTM-style domains, ADaM-style datasets, and statistical outputs
- Recognize common sources of disconnect between analysis-ready data and trial operations
- Practice translating statistical ideas into pharmaceutical development workflows
- Build intuition for response endpoints, time-to-event endpoints, missing data, and data cleaning decisions
- Prepare for industry-facing roles in biostatistics, statistical programming, data science, and clinical development
Instructors can use the repository as a modular teaching resource.
-
Short classroom demonstration Run the Streamlit app or selected scripts to show how one simulated trial flows from raw capture to analysis output.
-
Lab assignment Ask students to inspect raw data issues, generate SDTM-style domains, derive ADaM-style variables, and interpret the resulting tables and figures.
-
Case-study discussion Use the simulated oncology workflow to discuss endpoint definition, best overall response, progression-free survival, overall survival, missingness, and operational data quality.
-
Capstone or mini-project Have students modify
cfg.json, rerun the workflow, and explain how design or data-quality assumptions affect downstream statistical reporting.
ClinTrialDataFlow/
├── ClinTrialDataFlow/
│ ├── app.py # Streamlit web app
│ ├── cfg.json # Simulation configuration
│ ├── run_all.sh # Convenience script for the full workflow
│ ├── Codes/
│ │ ├── EDCSimu.py # EDC / raw data simulation
│ │ ├── SDTMSimu.py # Raw -> SDTM-style domains
│ │ ├── ADaMSimu.py # SDTM -> ADaM-style datasets
│ │ └── TFLSimu.py # ADaM -> TFL outputs
│ └── Data/
│ ├── raw_out/
│ ├── sdtm_out/
│ ├── adam_out/
│ └── tfl_out/
├── LICENSE
└── README.md
pip install pandas numpy matplotlib streamlitOptional:
pip install scipyscipy is used for exact binomial confidence intervals in the objective response rate output when available.
From the repository root:
cd ClinTrialDataFlow
streamlit run app.pyThe web interface allows users to edit and save cfg.json, run the full workflow, preview generated datasets, and download outputs from each stage.
From the ClinTrialDataFlow/ project directory:
python Codes/EDCSimu.py --cfg cfg.json --out Data/raw_out
python Codes/SDTMSimu.py --indir Data/raw_out --out Data/sdtm_out
python Codes/ADaMSimu.py --insdtm Data/sdtm_out --out Data/adam_out
python Codes/TFLSimu.py --inadam Data/adam_out --out Data/tfl_outSimulation behavior is controlled through cfg.json, including:
- Sample size
- Random seed
- Visit and treatment assumptions
- Dropout mechanisms
- Form-level and item-level missingness
- Output directory settings
Students can modify these settings to study how upstream trial assumptions propagate into downstream analysis datasets and statistical outputs.
The workflow generates four broad categories of teaching outputs:
- EDC/raw-like data: source-style tables that resemble clinical data capture
- SDTM-style domains: standardized domains used to introduce clinical data tabulation concepts
- ADaM-style datasets: analysis-ready datasets for subject-level summaries, response endpoints, time-to-event endpoints, adverse events, and laboratory results
- TFL outputs: baseline characteristics, ORR, BOR/DCR distribution, PFS/OS summaries, AE summaries, laboratory shift summaries, Kaplan-Meier figures, waterfall plots, and spider plots
The SDTM and ADaM outputs are intentionally described as "style" datasets because this repository is built for teaching and demonstration. It is not intended to claim regulatory validation or full CDISC compliance.
This repository accompanies the manuscript:
Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice
The manuscript introduces the pedagogical motivation and teaching framework. This repository provides supporting materials that can help instructors implement or adapt the framework in a course, seminar, workshop, or professional training setting.
This repository is for educational purposes only. It does not provide regulatory, medical, clinical, or operational guidance for real clinical trials. The materials are simplified for teaching and should not be used as validated tools for clinical development, regulatory submission, or patient-level decision-making.
Any examples inspired by pharmaceutical development are used for instructional discussion only.
If you use this repository in teaching or research, please cite the accompanying manuscript:
Cui, E. Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice.
A full citation will be added after publication.
Eliuvish Cui
