Derived Variable Detection Toolkit

This toolkit is designed to detect derived or "collider" variables within a pandas DataFrame. It helps identify variables that are deterministic or probabilistic functions of other variables, which is crucial for data preprocessing, feature selection, and avoiding common pitfalls in causal inference.

The toolkit provides methods to detect both linear and complex, nonlinear relationships, using cross-validated R-squared scores to provide a reliable measure of dependency.

Features

Modular Design: Separate, clean functions for different detection methods.
Linear Detection: Uses LinearRegression to find strong linear relationships.
Nonlinear Detection: Uses RandomForestRegressor to capture complex, nonlinear patterns.
Robust Scoring: Employs cross-validation to prevent overfitting and provide reliable R-squared scores.
Example-Driven: Comes with clear examples and test datasets.

Installation

To use the toolkit, you first need to install its dependencies:

pip install pandas scikit-learn

How to Use

The main entry point is the detect_derived function, which can be imported directly from the derived_detector package.

import pandas as pd
from derived_detector import detect_derived

# Load your dataset
df = pd.read_csv('path/to/your/data.csv')

# Run the detection function
# By default, it runs both 'linear' and 'nonlinear' methods
report = detect_derived(df)

# Print the report
print(report)

The function returns a pandas DataFrame, sorted by the nonlinear R-squared score, making it easy to spot the most predictable variables.

Included Datasets

This toolkit comes with two pre-generated datasets for testing and demonstration, located in the data/ directory.

1. `synthetic_dataset.csv`

A simple 5-column dataset designed to test the core logic.

A, B: Independent base variables (normally distributed).
C_linear_dependent: A highly linear function of A and B.
- Formula: C = 2*A + 3*B + noise
D_nonlinear_dependent: A nonlinear function of A and B.
- Formula: D = A * B + noise
E_independent: An independent control variable.

2. `clinical_dataset.csv`

A more complex and realistic 30-column, 5000-row dataset simulating clinical patient data. It contains a mix of independent and derived variables.

Key Derived Relationships:

weight_kg: Linearly dependent on height_cm.
bmi: A deterministic, nonlinear function of weight_kg and height_cm.
- Formula: bmi = weight_kg / (height_cm / 100)**2
systolic_bp / diastolic_bp: Linearly dependent on age and bmi.
treatment_assigned: A collider whose probability depends on systolic_bp and cholesterol_base.
disease_outcome: A binary outcome based on a nonlinear (logistic) combination of age, bmi, and systolic_bp.
Other markers like hdl_cholesterol, ldl_cholesterol, creatinine_level, and pulse_rate are also derived from other base variables with added noise.

Running Examples

The examples/ directory contains scripts to demonstrate the toolkit. To run them, navigate to the project root directory and execute:

# Run on the simple synthetic dataset
python examples/usage_example.py

# Run on the complex clinical dataset
python examples/clinical_example.py

Running Tests

The toolkit includes a full suite of unit tests to ensure its correctness. To run the tests, use Python's built-in unittest module from the root directory:

python -m unittest discover tests

The tests will run on the synthetic_dataset.csv and verify that the detection modules correctly identify the known relationships.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Derived Variable Detection Toolkit

Features

Installation

How to Use

Included Datasets

1. `synthetic_dataset.csv`

2. `clinical_dataset.csv`

Running Examples

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
derived_detector		derived_detector
examples		examples
tests		tests
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Derived Variable Detection Toolkit

Features

Installation

How to Use

Included Datasets

1. synthetic_dataset.csv

2. clinical_dataset.csv

Running Examples

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `synthetic_dataset.csv`

2. `clinical_dataset.csv`

Packages