Skip to content

ElvisCuiHan/ClinTrialDataFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClinTrialDataFlow logo

ClinTrialDataFlow (CTDF)

A teaching companion repository for clinical trial data flow education

Python Status License

Overview

ClinTrialDataFlow is a teaching companion repository for the manuscript Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice.

The repository provides a lightweight, reproducible clinical trial data workflow that helps students see how data move from electronic data capture to SDTM-style domains, ADaM-style analysis datasets, and statistical tables, figures, and listings. It is designed for education, classroom demonstration, and method prototyping, not for production clinical trial operations.

Manuscript Illustration

From academia to pharma industry: path to real-world impact

This conceptual illustration from the manuscript frames the repository as a bridge from academic statistical training to pharmaceutical evidence generation and real-world clinical impact.

Visual Workflow

Animated ClinTrialDataFlow oncology data-flow illustration

The illustration above summarizes the teaching workflow used throughout this repository: students trace synthetic oncology records from EDC/raw capture through SDTM-style domains, ADaM-style derivations, and final TFL outputs.

Purpose

Many biostatistics students first encounter statistical methods through clean, analysis-ready datasets. In pharmaceutical practice, however, statistical work is embedded in a broader data lifecycle involving protocols, electronic case report forms, data management, statistical programming, derived endpoints, interim decisions, and regulatory-facing interpretation.

ClinTrialDataFlow was created to make that lifecycle visible in the classroom. It supports instructors who want to teach not only how to analyze clinical trial data, but also where those data come from, why they look the way they do, and how statistical decisions interact with operational and regulatory constraints.

The central workflow is:

EDC / Raw -> SDTM-style domains -> ADaM-style datasets -> TFL outputs

What This Repository Contains

This repository includes:

  • Simulated EDC/raw clinical trial data
  • SDTM-style domain generation
  • ADaM-style dataset derivation
  • Statistical tables and figures for classroom discussion
  • Oncology-oriented examples, including RECIST-like response, tumor burden, PFS, OS, TEAEs, and laboratory abnormalities
  • Configurable missingness, dropout, death events, and query-like data imperfections
  • A Streamlit interface for interactive classroom demonstration

All examples are synthetic or publicly discussable teaching examples. No confidential patient-level data are included.

Educational Goals

ClinTrialDataFlow is intended to help students:

  • Understand the lifecycle of a clinical trial data point
  • Connect eCRF-style capture, raw datasets, SDTM-style domains, ADaM-style datasets, and statistical outputs
  • Recognize common sources of disconnect between analysis-ready data and trial operations
  • Practice translating statistical ideas into pharmaceutical development workflows
  • Build intuition for response endpoints, time-to-event endpoints, missing data, and data cleaning decisions
  • Prepare for industry-facing roles in biostatistics, statistical programming, data science, and clinical development

Suggested Classroom Use

Instructors can use the repository as a modular teaching resource.

  1. Short classroom demonstration Run the Streamlit app or selected scripts to show how one simulated trial flows from raw capture to analysis output.

  2. Lab assignment Ask students to inspect raw data issues, generate SDTM-style domains, derive ADaM-style variables, and interpret the resulting tables and figures.

  3. Case-study discussion Use the simulated oncology workflow to discuss endpoint definition, best overall response, progression-free survival, overall survival, missingness, and operational data quality.

  4. Capstone or mini-project Have students modify cfg.json, rerun the workflow, and explain how design or data-quality assumptions affect downstream statistical reporting.

Project Structure

ClinTrialDataFlow/
├── ClinTrialDataFlow/
│   ├── app.py                 # Streamlit web app
│   ├── cfg.json               # Simulation configuration
│   ├── run_all.sh             # Convenience script for the full workflow
│   ├── Codes/
│   │   ├── EDCSimu.py         # EDC / raw data simulation
│   │   ├── SDTMSimu.py        # Raw -> SDTM-style domains
│   │   ├── ADaMSimu.py        # SDTM -> ADaM-style datasets
│   │   └── TFLSimu.py         # ADaM -> TFL outputs
│   └── Data/
│       ├── raw_out/
│       ├── sdtm_out/
│       ├── adam_out/
│       └── tfl_out/
├── LICENSE
└── README.md

Quick Start

Requirements

pip install pandas numpy matplotlib streamlit

Optional:

pip install scipy

scipy is used for exact binomial confidence intervals in the objective response rate output when available.

Option A: Interactive Web App

From the repository root:

cd ClinTrialDataFlow
streamlit run app.py

The web interface allows users to edit and save cfg.json, run the full workflow, preview generated datasets, and download outputs from each stage.

Option B: Command-Line Pipeline

From the ClinTrialDataFlow/ project directory:

python Codes/EDCSimu.py  --cfg cfg.json --out Data/raw_out
python Codes/SDTMSimu.py --indir Data/raw_out --out Data/sdtm_out
python Codes/ADaMSimu.py --insdtm Data/sdtm_out --out Data/adam_out
python Codes/TFLSimu.py  --inadam Data/adam_out --out Data/tfl_out

Configuration

Simulation behavior is controlled through cfg.json, including:

  • Sample size
  • Random seed
  • Visit and treatment assumptions
  • Dropout mechanisms
  • Form-level and item-level missingness
  • Output directory settings

Students can modify these settings to study how upstream trial assumptions propagate into downstream analysis datasets and statistical outputs.

Outputs

The workflow generates four broad categories of teaching outputs:

  • EDC/raw-like data: source-style tables that resemble clinical data capture
  • SDTM-style domains: standardized domains used to introduce clinical data tabulation concepts
  • ADaM-style datasets: analysis-ready datasets for subject-level summaries, response endpoints, time-to-event endpoints, adverse events, and laboratory results
  • TFL outputs: baseline characteristics, ORR, BOR/DCR distribution, PFS/OS summaries, AE summaries, laboratory shift summaries, Kaplan-Meier figures, waterfall plots, and spider plots

The SDTM and ADaM outputs are intentionally described as "style" datasets because this repository is built for teaching and demonstration. It is not intended to claim regulatory validation or full CDISC compliance.

Relationship to the Manuscript

This repository accompanies the manuscript:

Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice

The manuscript introduces the pedagogical motivation and teaching framework. This repository provides supporting materials that can help instructors implement or adapt the framework in a course, seminar, workshop, or professional training setting.

Disclaimer

This repository is for educational purposes only. It does not provide regulatory, medical, clinical, or operational guidance for real clinical trials. The materials are simplified for teaching and should not be used as validated tools for clinical development, regulatory submission, or patient-level decision-making.

Any examples inspired by pharmaceutical development are used for instructional discussion only.

Citation

If you use this repository in teaching or research, please cite the accompanying manuscript:

Cui, E. Take Me Home, Pharma Road: A Simulation-Based Teaching Framework for Biostatistics Students Entering Pharmaceutical Practice.

A full citation will be added after publication.

Contact

Eliuvish Cui

elviscuihan@g.ucla.edu

About

End-to-end clinical trial data simulation: EDC → SDTM → ADaM → TFL, with realistic imperfections and interactive workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors