This repository showcases an end-to-end data engineering and machine learning pipeline orchestrated to classify disaster-related messages in real-time. Designed to ingest raw streams (e.g., from social media and texts), the predictive model instantly categorizes distress signals to route them to appropriate emergency response agencies. It demonstrates robust ETL pipeline construction, high-dimensional Natural Language Processing (NLP), and the deployment of a RandomForestClassifier served through a scalable Flask web application.
For an extensive dive into the mathematical decisioning and architectural trade-offs behind this project, see TECHNICAL.md.
flowchart LR
A[Raw Data CSVs] -->|ETL Script| B(Pandas Cleaning)
B --> C[(SQLite Database)]
C -->|ML Pipeline| D{TF-IDF Tokenizer}
D --> E[Random Forest Classifier]
E --> F((Model.pkl))
F -->|Flask App| G[Web Deployment]
G --> H[End User Visualizations]
- Deterministic ETL: Extracts textual data, imputes missing values, engineers deterministic categorization features, and normalizes output into SQLite.
- Robust ML Pipeline: Implements
scikit-learnvia a customized pipeline containing an NLTK-based tokenizer, TF-IDF vectorizer, and a computationally optimizedRandomForestClassifierwrapped within aMultiOutputClassifier. - Full-Stack Presentation: Dynamic inference via a Python/Flask web front-end containing interactive Plotly charts tracking dataset distribution metrics.
Ensure you have an active Python virtual environment (e.g., venv or conda), and run the following commands to initialize the pipeline dependencies.
git clone https://github.com/stephengardnerd/DataEngineering_MLPipeline.git
cd DataEngineering_MLPipeline
pip install -r requirements.txtThis ingestion script merges raw disaster datasets, applies robust preprocessing schemas, and streams the output directly into a relational SQLite database.
python disaster_response_pipeline_project/data/process_data.py \
disaster_response_pipeline_project/data/disaster_messages.csv \
disaster_response_pipeline_project/data/disaster_categories.csv \
disaster_response_pipeline_project/data/DisasterResponse.dbThis command loads the normalized data, trains the Random Forest algorithm (leveraging a MultiOutput wrapper), and persists the output as a .pkl for dynamic inference.
python disaster_response_pipeline_project/models/train_classifier.py \
disaster_response_pipeline_project/data/DisasterResponse.db \
disaster_response_pipeline_project/models/classifier.pklInitiate the routing application to interface with the predictive model in real time.
cd disaster_response_pipeline_project/app
python run.py ../data/DisasterResponse.db ../models/classifier.pklNavigate to http://0.0.0.0:3001/ to view the running implementation.
Author: Stephen D. Gardner