This project implements an interpretable pre-flight machine learning system that predicts severity risk and incident categories for commercial flights using NASA ASRS (Aviation Safety Reporting System) data. The system provides calibrated probabilities, SHAP explanations, and a comprehensive Streamlit dashboard for operational use.
- Binary Risk Model: Predicts flight severity risk with calibrated probabilities
- Multiclass Classification: Categorizes incidents into Human Factors, Maintenance, Weather, Communication, and Other
- Explainable AI: SHAP-based explanations for both global and local interpretability
- Interactive Dashboard: Streamlit application for single-flight prediction, fleet analytics, and batch scoring
- Strict Leakage Prevention: Uses only pre-flight features to ensure operational validity
- Calibrated Predictions: Isotonic regression and Platt scaling for reliable probability estimates
Flight-predictive-Analysis/
โโโ ๐ config/
โ โโโ config.yaml # Project configuration
โโโ ๐ data/
โ โโโ processed/ # Processed features and metadata
โ โ โโโ feature_mapping.yaml
โ โ โโโ feature_names.yaml
โ โ โโโ split_info.yaml
โ โโโ raw/ # Raw ASRS dataset (created during execution)
โโโ ๐ docs/ # Project documentation and reports
โโโ ๐ evaluation_results/ # Model evaluation configurations
โโโ ๐ models/
โ โโโ binary_model_metadata.yaml
โ โโโ feature_info.yaml
โ โโโ multiclass_model_metadata.yaml
โ โโโ trained/
โ โโโ binary_severity_model.pkl
โโโ ๐ notebooks/ # Jupyter notebooks for analysis
โ โโโ 01_data_acquisition_and_eda.ipynb
โ โโโ 02_preprocessing_and_feature_engineering.ipynb
โ โโโ 03_binary_severity_model.ipynb
โ โโโ 04_multiclass_incident_classification.ipynb
โ โโโ 05_shap_explanations.ipynb
โ โโโ 06_dashboard_plots_and_visualizations.ipynb
โ โโโ 06_dashboard_plots_and_visualizations_executed.ipynb
โโโ ๐ results/ # Model results and outputs
โ โโโ binary/
โ โโโ dashboard/
โ โโโ multiclass/
โ โโโ shap/
โโโ ๐ streamlit_app/
โ โโโ app.py # Main Streamlit application
โ โโโ pages/
โ โโโ 02_Models_Evaluation.py
โ โโโ 03_Dashboard.py
โ โโโ 04_Comparison_Novelty.py
โ โโโ 05_Future_Scope_Conclusion.py
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
- Python 3.8 or higher
- Git
- 4GB+ RAM (for model training)
- Internet connection (for dataset download)
# Clone and setup
git clone <repository-url>
cd Flight-predictive-Analysis
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
pip install -r requirements.txt
# Run analysis notebooks
jupyter lab
# Execute notebooks 01-06 in order in the notebooks/ folder
# Run dashboard
cd streamlit_app
streamlit run app.py
# Open http://localhost:8501-
Clone the repository
git clone <repository-url> cd Flight-predictive-Analysis
-
Create virtual environment
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Verify project structure The project comes with the necessary directory structure. Ensure all folders are present as shown in the Project Structure section above.
-
Start Jupyter Lab/Notebook
# Start Jupyter Lab (recommended) jupyter lab # Or start Jupyter Notebook jupyter notebook
-
Execute notebooks in order: Navigate to the
notebooks/folder and run the following notebooks sequentially:- 01_data_acquisition_and_eda.ipynb: Downloads ASRS dataset and performs exploratory data analysis
- 02_preprocessing_and_feature_engineering.ipynb: Cleans data and creates features
- 03_binary_severity_model.ipynb: Trains binary severity classification model
- 04_multiclass_incident_classification.ipynb: Trains multiclass incident categorization model
- 05_shap_explanations.ipynb: Generates SHAP explanations for model interpretability
- 06_dashboard_plots_and_visualizations.ipynb: Creates visualizations for the dashboard
Important: Run notebooks in the specified order as each depends on outputs from previous notebooks.
You can run specific notebooks individually if you need to:
# Run a specific notebook from command line
jupyter nbconvert --to notebook --execute notebooks/01_data_acquisition_and_eda.ipynb
# Or run and save output
jupyter nbconvert --to notebook --execute --inplace notebooks/01_data_acquisition_and_eda.ipynbIf you prefer running Python scripts:
# Convert notebook to Python script
jupyter nbconvert --to script notebooks/01_data_acquisition_and_eda.ipynb
# Run the generated Python script
python notebooks/01_data_acquisition_and_eda.pyPrerequisites: Ensure you have completed the notebook execution steps above to generate the required models and data.
# Navigate to the streamlit app directory
cd streamlit_app
# Run the Streamlit application
streamlit run app.pyThe dashboard will be available at http://localhost:8501
Dashboard Features:
- Home: Project overview and navigation
- Models Evaluation: Model performance metrics and analysis
- Dashboard: Interactive flight risk assessment tool
- Comparison Novelty: Model comparison and novelty detection
- Future Scope: Project conclusions and future enhancements
Note: If you encounter any missing file errors, ensure all notebooks have been executed successfully and the required model files exist in the models/ and results/ directories.
- ROC-AUC: >0.85 (target)
- PR-AUC: Reported with confidence intervals
- Calibration: ECE <0.05 for reliable probabilities
- Features: Pre-flight only (aircraft, crew, weather, maintenance)
- Macro F1-Score: Primary evaluation metric
- Per-class Metrics: Precision, Recall, F1 for each category
- Class Handling: Weighted loss for imbalanced classes
- Categories: Human Factors, Maintenance, Weather, Communication, Other
- LeakagePreventionValidator: Automatically identifies and rejects post-incident features
- Whitelist Approach: Only pre-flight features used for predictions
- Temporal Validation: Ensures no future information leakage
- SHAP Integration: TreeExplainer for gradient boosting models
- Global Explanations: Feature importance across entire dataset
- Local Explanations: Per-prediction waterfall plots
- Feature Interactions: SHAP interaction values and heatmaps
- Isotonic Regression: Non-parametric calibration method
- Platt Scaling: Sigmoid-based calibration alternative
- Reliability Curves: Visual calibration assessment
- Expected Calibration Error: Quantitative calibration metric
- System overview and navigation
- Key performance metrics
- Model status indicators
- Interactive input form
- Real-time risk assessment
- SHAP explanations
- Actionable recommendations
- Risk distribution analysis
- Incident category trends
- Temporal risk patterns
- Alert system for high-risk periods
- ROC and PR curves
- Calibration plots
- Feature importance analysis
- Performance monitoring
- CSV file upload
- Bulk prediction processing
- Results visualization
- Export functionality
import joblib
import pandas as pd
# Load models
binary_model = joblib.load('models/binary_severity_model.joblib')
preprocessor = joblib.load('models/preprocessor.joblib')
# Prepare flight data
flight_data = pd.DataFrame({
'aircraft_make': ['Boeing'],
'aircraft_model': ['737-800'],
'phase_of_flight': ['Cruise'],
'weather_condition': ['VMC'],
'crew_experience_level': ['Experienced'],
# ... other features
})
# Generate prediction
X_processed = preprocessor.transform(flight_data)
risk_probability = binary_model.predict_proba(X_processed)[0, 1]
print(f"Flight Risk Probability: {risk_probability:.3f}")# Load batch data
batch_data = pd.read_csv('flight_batch.csv')
# Process predictions
X_batch = preprocessor.transform(batch_data)
risk_scores = binary_model.predict_proba(X_batch)[:, 1]
# Add results to dataframe
batch_data['risk_score'] = risk_scores
batch_data['high_risk_flag'] = (risk_scores > 0.7).astype(int)
# Export results
batch_data.to_csv('batch_predictions.csv', index=False)The project uses configuration files located in the config/ directory:
- config/config.yaml: Main project configuration
- data/processed/: Feature mappings and processing metadata
- models/: Model metadata and configuration
- evaluation_results/: Dashboard and analysis configurations
Key configuration files:
# config/config.yaml - Main project settings
data:
dataset_name: "elihoole/asrs-aviation-reports"
target_columns:
binary: "severe_flag"
multiclass: "incident_category"
models:
binary:
algorithms: ["logistic_regression", "xgboost"]
calibration_methods: ["isotonic", "platt"]
multiclass:
algorithms: ["random_forest", "xgboost"]
class_weight: "balanced"
features:
leakage_keywords: ["outcome", "result", "damage", "injury"]
max_cardinality: 50
evaluation:
test_size: 0.2
validation_size: 0.2
random_state: 42- ROC-AUC: Area under ROC curve
- PR-AUC: Area under Precision-Recall curve
- Brier Score: Calibration quality measure
- Expected Calibration Error (ECE): Reliability assessment
- Confusion Matrix: At optimal threshold
- Accuracy: Overall classification accuracy
- Macro F1-Score: Unweighted average F1 across classes
- Weighted F1-Score: Sample-weighted average F1
- Per-class Precision/Recall: Individual class performance
- Confusion Matrix: Multi-class classification matrix
-
Jupyter Notebook Issues
- Kernel not starting:
pip install ipykerneland restart Jupyter - Module not found: Ensure virtual environment is activated and all dependencies installed
- Notebook won't execute: Check if previous notebooks completed successfully
- Out of memory: Restart kernel and run notebooks individually
- Kernel not starting:
-
Memory Errors
- Reduce batch size in processing
- Use data chunking for large datasets
- Increase system RAM or use cloud computing
- Restart Jupyter kernel between notebooks
-
Model Loading Errors
- Ensure all notebooks have been run in order
- Check file paths in config/config.yaml
- Verify model files exist in models/trained/ directory
- Re-run notebook 03 (binary model) if binary_severity_model.pkl is missing
-
Dashboard Not Loading
- Check Streamlit installation:
pip install streamlit - Verify port 8501 is available
- Run from streamlit_app/ directory
- Ensure models and results files exist from notebook execution
- Check Streamlit installation:
-
Data Download Issues
- Check internet connection
- Verify Hugging Face datasets library:
pip install datasets - Try manual dataset download if automated fails
- Clear cache:
rm -rf ~/.cache/huggingface/(Linux/Mac) or delete cache folder (Windows)
-
Missing Files Errors
- feature_info.yaml missing: Re-run notebook 02 (preprocessing)
- model_metadata.yaml missing: Re-run notebooks 03 and 04 (model training)
- results files missing: Re-run notebooks 03-06 in sequence
- Model Training: Use GPU acceleration for XGBoost
- SHAP Calculations: Use TreeExplainer for faster computation
- Dashboard: Enable caching with @st.cache_data
- Batch Processing: Process in chunks for large datasets
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings to all functions
- Include unit tests for new features
- Update documentation for API changes
- Ensure backward compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
- NASA ASRS: Aviation Safety Reporting System for providing the dataset
- Hugging Face: For hosting and maintaining the dataset
- SHAP: For explainable AI capabilities
- Streamlit: For the interactive dashboard framework
- scikit-learn: For machine learning algorithms and utilities
- Real-time Integration: API endpoints for live flight data
- Advanced Models: Deep learning and ensemble methods
- Additional Data Sources: Weather APIs, maintenance records
- Mobile Dashboard: Responsive design for mobile devices
- Automated Retraining: MLOps pipeline for model updates
- A/B Testing: Framework for model comparison