Skip to content

zhenchenwang/CausalSuperCollider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Derived Variable Detection Toolkit

This toolkit is designed to detect derived or "collider" variables within a pandas DataFrame. It helps identify variables that are deterministic or probabilistic functions of other variables, which is crucial for data preprocessing, feature selection, and avoiding common pitfalls in causal inference.

The toolkit provides methods to detect both linear and complex, nonlinear relationships, using cross-validated R-squared scores to provide a reliable measure of dependency.

Features

  • Modular Design: Separate, clean functions for different detection methods.
  • Linear Detection: Uses LinearRegression to find strong linear relationships.
  • Nonlinear Detection: Uses RandomForestRegressor to capture complex, nonlinear patterns.
  • Robust Scoring: Employs cross-validation to prevent overfitting and provide reliable R-squared scores.
  • Example-Driven: Comes with clear examples and test datasets.

Installation

To use the toolkit, you first need to install its dependencies:

pip install pandas scikit-learn

How to Use

The main entry point is the detect_derived function, which can be imported directly from the derived_detector package.

import pandas as pd
from derived_detector import detect_derived

# Load your dataset
df = pd.read_csv('path/to/your/data.csv')

# Run the detection function
# By default, it runs both 'linear' and 'nonlinear' methods
report = detect_derived(df)

# Print the report
print(report)

The function returns a pandas DataFrame, sorted by the nonlinear R-squared score, making it easy to spot the most predictable variables.

Included Datasets

This toolkit comes with two pre-generated datasets for testing and demonstration, located in the data/ directory.

1. synthetic_dataset.csv

A simple 5-column dataset designed to test the core logic.

  • A, B: Independent base variables (normally distributed).
  • C_linear_dependent: A highly linear function of A and B.
    • Formula: C = 2*A + 3*B + noise
  • D_nonlinear_dependent: A nonlinear function of A and B.
    • Formula: D = A * B + noise
  • E_independent: An independent control variable.

2. clinical_dataset.csv

A more complex and realistic 30-column, 5000-row dataset simulating clinical patient data. It contains a mix of independent and derived variables.

Key Derived Relationships:

  • weight_kg: Linearly dependent on height_cm.
  • bmi: A deterministic, nonlinear function of weight_kg and height_cm.
    • Formula: bmi = weight_kg / (height_cm / 100)**2
  • systolic_bp / diastolic_bp: Linearly dependent on age and bmi.
  • treatment_assigned: A collider whose probability depends on systolic_bp and cholesterol_base.
  • disease_outcome: A binary outcome based on a nonlinear (logistic) combination of age, bmi, and systolic_bp.
  • Other markers like hdl_cholesterol, ldl_cholesterol, creatinine_level, and pulse_rate are also derived from other base variables with added noise.

Running Examples

The examples/ directory contains scripts to demonstrate the toolkit. To run them, navigate to the project root directory and execute:

# Run on the simple synthetic dataset
python examples/usage_example.py

# Run on the complex clinical dataset
python examples/clinical_example.py

Running Tests

The toolkit includes a full suite of unit tests to ensure its correctness. To run the tests, use Python's built-in unittest module from the root directory:

python -m unittest discover tests

The tests will run on the synthetic_dataset.csv and verify that the detection modules correctly identify the known relationships.

About

Collider detection for safety surveillance datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages