Skip to content

yashnayan8795/Flight-Risk-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โœˆ๏ธ Interpretable Risk Scoring and Incident Classification for Commercial Flights

๐Ÿ“‹ Project Overview

This project implements an interpretable pre-flight machine learning system that predicts severity risk and incident categories for commercial flights using NASA ASRS (Aviation Safety Reporting System) data. The system provides calibrated probabilities, SHAP explanations, and a comprehensive Streamlit dashboard for operational use.

๐ŸŽฏ Key Features

  • Binary Risk Model: Predicts flight severity risk with calibrated probabilities
  • Multiclass Classification: Categorizes incidents into Human Factors, Maintenance, Weather, Communication, and Other
  • Explainable AI: SHAP-based explanations for both global and local interpretability
  • Interactive Dashboard: Streamlit application for single-flight prediction, fleet analytics, and batch scoring
  • Strict Leakage Prevention: Uses only pre-flight features to ensure operational validity
  • Calibrated Predictions: Isotonic regression and Platt scaling for reliable probability estimates

๐Ÿ—๏ธ Project Structure

Flight-predictive-Analysis/
โ”œโ”€โ”€ ๐Ÿ“ config/
โ”‚   โ””โ”€โ”€ config.yaml             # Project configuration
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ”œโ”€โ”€ processed/              # Processed features and metadata
โ”‚   โ”‚   โ”œโ”€โ”€ feature_mapping.yaml
โ”‚   โ”‚   โ”œโ”€โ”€ feature_names.yaml
โ”‚   โ”‚   โ””โ”€โ”€ split_info.yaml
โ”‚   โ””โ”€โ”€ raw/                    # Raw ASRS dataset (created during execution)
โ”œโ”€โ”€ ๐Ÿ“ docs/                    # Project documentation and reports
โ”œโ”€โ”€ ๐Ÿ“ evaluation_results/      # Model evaluation configurations
โ”œโ”€โ”€ ๐Ÿ“ models/
โ”‚   โ”œโ”€โ”€ binary_model_metadata.yaml
โ”‚   โ”œโ”€โ”€ feature_info.yaml
โ”‚   โ”œโ”€โ”€ multiclass_model_metadata.yaml
โ”‚   โ””โ”€โ”€ trained/
โ”‚       โ””โ”€โ”€ binary_severity_model.pkl
โ”œโ”€โ”€ ๐Ÿ“ notebooks/               # Jupyter notebooks for analysis
โ”‚   โ”œโ”€โ”€ 01_data_acquisition_and_eda.ipynb
โ”‚   โ”œโ”€โ”€ 02_preprocessing_and_feature_engineering.ipynb
โ”‚   โ”œโ”€โ”€ 03_binary_severity_model.ipynb
โ”‚   โ”œโ”€โ”€ 04_multiclass_incident_classification.ipynb
โ”‚   โ”œโ”€โ”€ 05_shap_explanations.ipynb
โ”‚   โ”œโ”€โ”€ 06_dashboard_plots_and_visualizations.ipynb
โ”‚   โ””โ”€โ”€ 06_dashboard_plots_and_visualizations_executed.ipynb
โ”œโ”€โ”€ ๐Ÿ“ results/                 # Model results and outputs
โ”‚   โ”œโ”€โ”€ binary/
โ”‚   โ”œโ”€โ”€ dashboard/
โ”‚   โ”œโ”€โ”€ multiclass/
โ”‚   โ””โ”€โ”€ shap/
โ”œโ”€โ”€ ๐Ÿ“ streamlit_app/
โ”‚   โ”œโ”€โ”€ app.py                  # Main Streamlit application
โ”‚   โ””โ”€โ”€ pages/
โ”‚       โ”œโ”€โ”€ 02_Models_Evaluation.py
โ”‚       โ”œโ”€โ”€ 03_Dashboard.py
โ”‚       โ”œโ”€โ”€ 04_Comparison_Novelty.py
โ”‚       โ””โ”€โ”€ 05_Future_Scope_Conclusion.py
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ””โ”€โ”€ README.md                   # This file

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • Git
  • 4GB+ RAM (for model training)
  • Internet connection (for dataset download)

โšก Quick Setup (TL;DR)

# Clone and setup
git clone <repository-url>
cd Flight-predictive-Analysis
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

# Run analysis notebooks
jupyter lab
# Execute notebooks 01-06 in order in the notebooks/ folder

# Run dashboard
cd streamlit_app
streamlit run app.py
# Open http://localhost:8501

Installation

  1. Clone the repository

    git clone <repository-url>
    cd Flight-predictive-Analysis
  2. Create virtual environment

    python -m venv venv
    
    # Windows
    venv\Scripts\activate
    
    # macOS/Linux
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Verify project structure The project comes with the necessary directory structure. Ensure all folders are present as shown in the Project Structure section above.

๐Ÿ“Š Running the Analysis

Option 1: Using Jupyter Notebooks (Recommended)

  1. Start Jupyter Lab/Notebook

    # Start Jupyter Lab (recommended)
    jupyter lab
    
    # Or start Jupyter Notebook
    jupyter notebook
  2. Execute notebooks in order: Navigate to the notebooks/ folder and run the following notebooks sequentially:

    • 01_data_acquisition_and_eda.ipynb: Downloads ASRS dataset and performs exploratory data analysis
    • 02_preprocessing_and_feature_engineering.ipynb: Cleans data and creates features
    • 03_binary_severity_model.ipynb: Trains binary severity classification model
    • 04_multiclass_incident_classification.ipynb: Trains multiclass incident categorization model
    • 05_shap_explanations.ipynb: Generates SHAP explanations for model interpretability
    • 06_dashboard_plots_and_visualizations.ipynb: Creates visualizations for the dashboard

    Important: Run notebooks in the specified order as each depends on outputs from previous notebooks.

Option 2: Running Individual Notebooks

You can run specific notebooks individually if you need to:

# Run a specific notebook from command line
jupyter nbconvert --to notebook --execute notebooks/01_data_acquisition_and_eda.ipynb

# Or run and save output
jupyter nbconvert --to notebook --execute --inplace notebooks/01_data_acquisition_and_eda.ipynb

Option 3: Convert to Python Scripts

If you prefer running Python scripts:

# Convert notebook to Python script
jupyter nbconvert --to script notebooks/01_data_acquisition_and_eda.ipynb

# Run the generated Python script
python notebooks/01_data_acquisition_and_eda.py

๐Ÿ–ฅ๏ธ Running the Streamlit Dashboard

Prerequisites: Ensure you have completed the notebook execution steps above to generate the required models and data.

# Navigate to the streamlit app directory
cd streamlit_app

# Run the Streamlit application
streamlit run app.py

The dashboard will be available at http://localhost:8501

Dashboard Features:

  • Home: Project overview and navigation
  • Models Evaluation: Model performance metrics and analysis
  • Dashboard: Interactive flight risk assessment tool
  • Comparison Novelty: Model comparison and novelty detection
  • Future Scope: Project conclusions and future enhancements

Note: If you encounter any missing file errors, ensure all notebooks have been executed successfully and the required model files exist in the models/ and results/ directories.

๐Ÿ“ˆ Model Performance

Binary Severity Model

  • ROC-AUC: >0.85 (target)
  • PR-AUC: Reported with confidence intervals
  • Calibration: ECE <0.05 for reliable probabilities
  • Features: Pre-flight only (aircraft, crew, weather, maintenance)

Multiclass Incident Classification

  • Macro F1-Score: Primary evaluation metric
  • Per-class Metrics: Precision, Recall, F1 for each category
  • Class Handling: Weighted loss for imbalanced classes
  • Categories: Human Factors, Maintenance, Weather, Communication, Other

๐Ÿ” Key Features & Capabilities

๐Ÿ›ก๏ธ Data Leakage Prevention

  • LeakagePreventionValidator: Automatically identifies and rejects post-incident features
  • Whitelist Approach: Only pre-flight features used for predictions
  • Temporal Validation: Ensures no future information leakage

๐Ÿ“Š Explainable AI

  • SHAP Integration: TreeExplainer for gradient boosting models
  • Global Explanations: Feature importance across entire dataset
  • Local Explanations: Per-prediction waterfall plots
  • Feature Interactions: SHAP interaction values and heatmaps

๐ŸŽฏ Calibrated Predictions

  • Isotonic Regression: Non-parametric calibration method
  • Platt Scaling: Sigmoid-based calibration alternative
  • Reliability Curves: Visual calibration assessment
  • Expected Calibration Error: Quantitative calibration metric

๐Ÿ“ฑ Interactive Dashboard

Home Page

  • System overview and navigation
  • Key performance metrics
  • Model status indicators

Single Flight Prediction

  • Interactive input form
  • Real-time risk assessment
  • SHAP explanations
  • Actionable recommendations

Fleet Dashboard

  • Risk distribution analysis
  • Incident category trends
  • Temporal risk patterns
  • Alert system for high-risk periods

Model Analytics

  • ROC and PR curves
  • Calibration plots
  • Feature importance analysis
  • Performance monitoring

Batch Scoring

  • CSV file upload
  • Bulk prediction processing
  • Results visualization
  • Export functionality

๐Ÿ“‹ Usage Examples

Single Flight Risk Assessment

import joblib
import pandas as pd

# Load models
binary_model = joblib.load('models/binary_severity_model.joblib')
preprocessor = joblib.load('models/preprocessor.joblib')

# Prepare flight data
flight_data = pd.DataFrame({
    'aircraft_make': ['Boeing'],
    'aircraft_model': ['737-800'],
    'phase_of_flight': ['Cruise'],
    'weather_condition': ['VMC'],
    'crew_experience_level': ['Experienced'],
    # ... other features
})

# Generate prediction
X_processed = preprocessor.transform(flight_data)
risk_probability = binary_model.predict_proba(X_processed)[0, 1]

print(f"Flight Risk Probability: {risk_probability:.3f}")

Batch Processing

# Load batch data
batch_data = pd.read_csv('flight_batch.csv')

# Process predictions
X_batch = preprocessor.transform(batch_data)
risk_scores = binary_model.predict_proba(X_batch)[:, 1]

# Add results to dataframe
batch_data['risk_score'] = risk_scores
batch_data['high_risk_flag'] = (risk_scores > 0.7).astype(int)

# Export results
batch_data.to_csv('batch_predictions.csv', index=False)

โš™๏ธ Configuration

The project uses configuration files located in the config/ directory:

  • config/config.yaml: Main project configuration
  • data/processed/: Feature mappings and processing metadata
  • models/: Model metadata and configuration
  • evaluation_results/: Dashboard and analysis configurations

Key configuration files:

# config/config.yaml - Main project settings
data:
  dataset_name: "elihoole/asrs-aviation-reports"
  target_columns:
    binary: "severe_flag"
    multiclass: "incident_category"

models:
  binary:
    algorithms: ["logistic_regression", "xgboost"]
    calibration_methods: ["isotonic", "platt"]
  multiclass:
    algorithms: ["random_forest", "xgboost"]
    class_weight: "balanced"

features:
  leakage_keywords: ["outcome", "result", "damage", "injury"]
  max_cardinality: 50

evaluation:
  test_size: 0.2
  validation_size: 0.2
  random_state: 42

๐Ÿ“Š Evaluation Metrics

Binary Model Metrics

  • ROC-AUC: Area under ROC curve
  • PR-AUC: Area under Precision-Recall curve
  • Brier Score: Calibration quality measure
  • Expected Calibration Error (ECE): Reliability assessment
  • Confusion Matrix: At optimal threshold

Multiclass Model Metrics

  • Accuracy: Overall classification accuracy
  • Macro F1-Score: Unweighted average F1 across classes
  • Weighted F1-Score: Sample-weighted average F1
  • Per-class Precision/Recall: Individual class performance
  • Confusion Matrix: Multi-class classification matrix

๐Ÿ”ง Troubleshooting

Common Issues

  1. Jupyter Notebook Issues

    • Kernel not starting: pip install ipykernel and restart Jupyter
    • Module not found: Ensure virtual environment is activated and all dependencies installed
    • Notebook won't execute: Check if previous notebooks completed successfully
    • Out of memory: Restart kernel and run notebooks individually
  2. Memory Errors

    • Reduce batch size in processing
    • Use data chunking for large datasets
    • Increase system RAM or use cloud computing
    • Restart Jupyter kernel between notebooks
  3. Model Loading Errors

    • Ensure all notebooks have been run in order
    • Check file paths in config/config.yaml
    • Verify model files exist in models/trained/ directory
    • Re-run notebook 03 (binary model) if binary_severity_model.pkl is missing
  4. Dashboard Not Loading

    • Check Streamlit installation: pip install streamlit
    • Verify port 8501 is available
    • Run from streamlit_app/ directory
    • Ensure models and results files exist from notebook execution
  5. Data Download Issues

    • Check internet connection
    • Verify Hugging Face datasets library: pip install datasets
    • Try manual dataset download if automated fails
    • Clear cache: rm -rf ~/.cache/huggingface/ (Linux/Mac) or delete cache folder (Windows)
  6. Missing Files Errors

    • feature_info.yaml missing: Re-run notebook 02 (preprocessing)
    • model_metadata.yaml missing: Re-run notebooks 03 and 04 (model training)
    • results files missing: Re-run notebooks 03-06 in sequence

Performance Optimization

  • Model Training: Use GPU acceleration for XGBoost
  • SHAP Calculations: Use TreeExplainer for faster computation
  • Dashboard: Enable caching with @st.cache_data
  • Batch Processing: Process in chunks for large datasets

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add docstrings to all functions
  • Include unit tests for new features
  • Update documentation for API changes
  • Ensure backward compatibility

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • NASA ASRS: Aviation Safety Reporting System for providing the dataset
  • Hugging Face: For hosting and maintaining the dataset
  • SHAP: For explainable AI capabilities
  • Streamlit: For the interactive dashboard framework
  • scikit-learn: For machine learning algorithms and utilities

๐Ÿ”ฎ Future Enhancements

  • Real-time Integration: API endpoints for live flight data
  • Advanced Models: Deep learning and ensemble methods
  • Additional Data Sources: Weather APIs, maintenance records
  • Mobile Dashboard: Responsive design for mobile devices
  • Automated Retraining: MLOps pipeline for model updates
  • A/B Testing: Framework for model comparison

Releases

No releases published

Packages

 
 
 

Contributors