A machine learning project to predict which Reddit requests for free pizza result in a positive or negative response.
This project uses natural language processing and machine learning to analyze and classify pizza requests from the Reddit community r/RandomActsOfPizza. The goal is to predict whether a request will receive a pizza based on various features including text content, user history, and temporal patterns.
This codebase has been modernized with:
- Updated Dependencies: All libraries upgraded to Python 3.11+ compatible versions
- Modern Python Practices: Type hints, pathlib, f-strings, and PEP 8 compliance
- Refactored Code: Object-oriented design with reusable components
- Enhanced Models: Added gradient boosting and modern ensemble methods
- Improved Feature Engineering: Automated feature extraction pipelines
- Better Documentation: Comprehensive docstrings and usage examples
- β
pandas.io.json.json_normalizeβpd.json_normalize - β
sklearn.grid_searchβsklearn.model_selection - β
sklearn.cross_validationβsklearn.model_selection
- Added Gradient Boosting classifiers
- Improved feature engineering pipeline
- Better cross-validation strategies
- Modern sentiment analysis with VADER
- Removed hardcoded paths - now uses
pathlib - Added comprehensive type hints
- Refactored repetitive code into reusable classes
- Improved variable naming and code organization
# Clone the repository
git clone https://github.com/yourusername/Classification-Pizza.git
cd Classification-Pizza
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download NLTK data (required for text processing)
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"from pizza_classifier import (
PizzaDataLoader,
FeatureEngineering,
TextPreprocessor,
PizzaClassifier
)
# Load data
loader = PizzaDataLoader()
train_df, dev_df, test_df = loader.load_data()
# Feature engineering
fe = FeatureEngineering()
train_df = fe.extract_temporal_features(train_df)
train_df = fe.extract_sentiment_features(train_df)
train_df = fe.extract_user_features(train_df)
# Create text corpus
preprocessor = TextPreprocessor()
train_corpus = preprocessor.create_text_corpus(train_df)
# Train classifier
classifier = PizzaClassifier(model_type='random_forest')
classifier.fit(
train_corpus,
fe.create_feature_matrix(train_df),
train_df['requester_received_pizza'].values
)
# Make predictions
predictions = classifier.predict(dev_corpus, dev_features)The original Jupyter notebooks are preserved in the repository:
Clean_Notebook_Compiled.ipynb- Main analysis notebookW207_Final_Project_Baseline_v4.ipynb- Baseline modelsLearning_Notebook_Compiled.ipynb- Experimental approaches
Our best models achieve:
- Random Forest: 83% accuracy, F1-score 0.81, AUC 0.71
- Logistic Regression: 74% accuracy, AUC 0.73
- Gradient Boosting: 80%+ accuracy with proper tuning
Key findings:
- Temporal features (hour, day) are strong predictors
- Text features with bigrams perform well
- Sentiment analysis provides marginal improvement
- Simple models (Logistic Regression) are competitive with complex ensembles
-
Deprecated Libraries (Fixed β )
- Old sklearn imports
- Outdated pandas JSON normalization
- Legacy string formatting
-
Code Duplication (Fixed β )
- Repetitive data preprocessing
- Duplicate model evaluation code
- Multiple similar notebooks
-
Hardcoded Values (Fixed β )
- File paths
- Magic numbers
- Configuration scattered throughout
-
Missing Best Practices (Fixed β )
- No type hints
- Limited error handling
- No virtual environment specification
- Inconsistent naming conventions
Classification-Pizza/
βββ data/
β βββ train.json # Training data
β βββ test.json # Test data
βββ pizza_classifier.py # β¨ NEW: Modern implementation
βββ requirements.txt # β¨ NEW: Dependency management
βββ README.md # β¨ UPDATED: This file
βββ Clean_Notebook_Compiled.ipynb
βββ W207_Final_Project_Baseline_v4.ipynb
βββ [Other legacy notebooks...]
- Python: 3.11+
- ML Frameworks: scikit-learn, XGBoost, LightGBM
- NLP: NLTK, VADER Sentiment
- Data: pandas, numpy
- Visualization: matplotlib, seaborn
- TF-IDF / Count vectorization
- N-gram analysis (unigrams, bigrams, trigrams)
- Sentiment scores (VADER)
- Text preprocessing and cleaning
- Hour of day
- Day of week
- Month/seasonality
- Weekend indicator
- Account age
- Reddit karma (upvotes/downvotes)
- Comment/post ratios
- Subreddit activity
- Previous pizza requests
- Logistic Regression (L1/L2 regularization)
- Random Forest
- Naive Bayes
- Support Vector Machines
- Gradient Boosting
- AdaBoost
- Ensemble methods
Data is from the Kaggle competition: Random Acts of Pizza
The dataset contains:
- 4,040 training samples
- 1,630 test samples
- 32 features per sample
- Binary classification target
Data is located in the /data directory.
This is an educational project. Feel free to fork and experiment!
MIT License - see LICENSE file for details
W207 Final Project Team:
- Erika Lawrence
- Leslie Teo
- Jen Jen Chen
- Geoff Stirling
Code modernization and updates: 2025