This toolkit is designed to detect derived or "collider" variables within a pandas DataFrame. It helps identify variables that are deterministic or probabilistic functions of other variables, which is crucial for data preprocessing, feature selection, and avoiding common pitfalls in causal inference.
The toolkit provides methods to detect both linear and complex, nonlinear relationships, using cross-validated R-squared scores to provide a reliable measure of dependency.
- Modular Design: Separate, clean functions for different detection methods.
- Linear Detection: Uses
LinearRegressionto find strong linear relationships. - Nonlinear Detection: Uses
RandomForestRegressorto capture complex, nonlinear patterns. - Robust Scoring: Employs cross-validation to prevent overfitting and provide reliable R-squared scores.
- Example-Driven: Comes with clear examples and test datasets.
To use the toolkit, you first need to install its dependencies:
pip install pandas scikit-learnThe main entry point is the detect_derived function, which can be imported directly from the derived_detector package.
import pandas as pd
from derived_detector import detect_derived
# Load your dataset
df = pd.read_csv('path/to/your/data.csv')
# Run the detection function
# By default, it runs both 'linear' and 'nonlinear' methods
report = detect_derived(df)
# Print the report
print(report)The function returns a pandas DataFrame, sorted by the nonlinear R-squared score, making it easy to spot the most predictable variables.
This toolkit comes with two pre-generated datasets for testing and demonstration, located in the data/ directory.
A simple 5-column dataset designed to test the core logic.
A,B: Independent base variables (normally distributed).C_linear_dependent: A highly linear function ofAandB.- Formula:
C = 2*A + 3*B + noise
- Formula:
D_nonlinear_dependent: A nonlinear function ofAandB.- Formula:
D = A * B + noise
- Formula:
E_independent: An independent control variable.
A more complex and realistic 30-column, 5000-row dataset simulating clinical patient data. It contains a mix of independent and derived variables.
Key Derived Relationships:
weight_kg: Linearly dependent onheight_cm.bmi: A deterministic, nonlinear function ofweight_kgandheight_cm.- Formula:
bmi = weight_kg / (height_cm / 100)**2
- Formula:
systolic_bp/diastolic_bp: Linearly dependent onageandbmi.treatment_assigned: A collider whose probability depends onsystolic_bpandcholesterol_base.disease_outcome: A binary outcome based on a nonlinear (logistic) combination ofage,bmi, andsystolic_bp.- Other markers like
hdl_cholesterol,ldl_cholesterol,creatinine_level, andpulse_rateare also derived from other base variables with added noise.
The examples/ directory contains scripts to demonstrate the toolkit. To run them, navigate to the project root directory and execute:
# Run on the simple synthetic dataset
python examples/usage_example.py
# Run on the complex clinical dataset
python examples/clinical_example.pyThe toolkit includes a full suite of unit tests to ensure its correctness. To run the tests, use Python's built-in unittest module from the root directory:
python -m unittest discover testsThe tests will run on the synthetic_dataset.csv and verify that the detection modules correctly identify the known relationships.