Skip to content

Latest commit

 

History

History
102 lines (77 loc) · 6.16 KB

File metadata and controls

102 lines (77 loc) · 6.16 KB

GenAI Tools for Coding & Research Workflows — A Practical 8-Step Process

Welcome to this repository containing a practical textbook on integrating Large Language Models (LLMs) into coding and research workflows. This guide is geared toward biostatisticians, bioinformaticians, and data scientists, providing hands-on steps and examples to enhance productivity and maintain rigorous standards when employing GenAI tools.

Repository Overview

How to Use This Repository

  1. Follow the 8-Step Process
    Read the chapters in order to learn how to plan, prompt, refine, and document your AI-assisted code effectively.

  2. Leverage the Appendices

    • Check out the recommended LLM tools, curated prompts, further reading, and code templates to accelerate your workflow.
  3. Explore the Templates

    • Use the R templates, refactoring prompts, and documentation guidelines in docs/templates/ to standardize your code and streamline collaboration.
  4. Apply Version Control

    • As described in Chapter 4, store each LLM-generated or refined code iteration in Git (e.g., GitHub) to maintain a clear history, review diffs, and revert if needed.
  5. Stay Tuned for References

Data Simulation Script

If you'd like to test your workflows or the 8-step process on example data, we provide a script that simulates EHR and demographic records with realistic data quality issues:

  • Script: scripts/simulate_ehr_data.R
  • Purpose: Generates BMI, height, weight, and demographic data for a specified number of individuals, optionally introducing missing or implausible values.

Example Usage

Rscript scripts/simulate_ehr_data.R \
  --output_ehr "./data/raw/ehr_bmi_simulated_data.tsv" \
  --output_ehr_dict "./data/raw/data_dictionary.txt" \
  --output_demo "./data/raw/demographics_simulated_data.tsv" \
  --output_demo_dict "./data/raw/demographics_data_dictionary.txt" \
  --seed 123 \
  --n_individuals 1000

Explanation of Arguments:

  • --output_ehr: Where to save the simulated EHR (BMI) data (TSV).
  • --output_ehr_dict: Where to write the EHR data dictionary (TXT).
  • --output_demo: Where to save the demographics data (TSV).
  • --output_demo_dict: Where to write the demographics data dictionary (TXT).
  • --seed: Random seed for reproducibility.
  • --n_individuals: Number of unique individuals to simulate (default=1000).

By running this script, you can quickly produce synthetic data for testing code-refinement prompts, checking your data-cleaning pipelines, or practicing the entire 8-step process from ingestion to documentation.

Contributing

  • We welcome pull requests for refinements, corrections, or extensions.
  • Please open issues for any questions, or join discussions to keep this textbook accurate and helpful.

License

This repository is licensed under the GNU General Public License v3.0.


Happy learning and coding with GenAI tools!