Skip to content

maynard242/Classification-Pizza

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

55 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Random Acts of Pizza Classification

A machine learning project to predict which Reddit requests for free pizza result in a positive or negative response.

πŸ“‹ Project Overview

This project uses natural language processing and machine learning to analyze and classify pizza requests from the Reddit community r/RandomActsOfPizza. The goal is to predict whether a request will receive a pizza based on various features including text content, user history, and temporal patterns.

πŸ†• Modern Updates (2025)

This codebase has been modernized with:

  • Updated Dependencies: All libraries upgraded to Python 3.11+ compatible versions
  • Modern Python Practices: Type hints, pathlib, f-strings, and PEP 8 compliance
  • Refactored Code: Object-oriented design with reusable components
  • Enhanced Models: Added gradient boosting and modern ensemble methods
  • Improved Feature Engineering: Automated feature extraction pipelines
  • Better Documentation: Comprehensive docstrings and usage examples

Key Improvements

1. Deprecated Import Updates

  • βœ… pandas.io.json.json_normalize β†’ pd.json_normalize
  • βœ… sklearn.grid_search β†’ sklearn.model_selection
  • βœ… sklearn.cross_validation β†’ sklearn.model_selection

2. Modern ML Methods

  • Added Gradient Boosting classifiers
  • Improved feature engineering pipeline
  • Better cross-validation strategies
  • Modern sentiment analysis with VADER

3. Code Quality

  • Removed hardcoded paths - now uses pathlib
  • Added comprehensive type hints
  • Refactored repetitive code into reusable classes
  • Improved variable naming and code organization

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/Classification-Pizza.git
cd Classification-Pizza

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download NLTK data (required for text processing)
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

Basic Usage

Using the Modern Python Module

from pizza_classifier import (
    PizzaDataLoader,
    FeatureEngineering,
    TextPreprocessor,
    PizzaClassifier
)

# Load data
loader = PizzaDataLoader()
train_df, dev_df, test_df = loader.load_data()

# Feature engineering
fe = FeatureEngineering()
train_df = fe.extract_temporal_features(train_df)
train_df = fe.extract_sentiment_features(train_df)
train_df = fe.extract_user_features(train_df)

# Create text corpus
preprocessor = TextPreprocessor()
train_corpus = preprocessor.create_text_corpus(train_df)

# Train classifier
classifier = PizzaClassifier(model_type='random_forest')
classifier.fit(
    train_corpus,
    fe.create_feature_matrix(train_df),
    train_df['requester_received_pizza'].values
)

# Make predictions
predictions = classifier.predict(dev_corpus, dev_features)

Using Legacy Notebooks

The original Jupyter notebooks are preserved in the repository:

  • Clean_Notebook_Compiled.ipynb - Main analysis notebook
  • W207_Final_Project_Baseline_v4.ipynb - Baseline models
  • Learning_Notebook_Compiled.ipynb - Experimental approaches

πŸ“Š Results

Our best models achieve:

  • Random Forest: 83% accuracy, F1-score 0.81, AUC 0.71
  • Logistic Regression: 74% accuracy, AUC 0.73
  • Gradient Boosting: 80%+ accuracy with proper tuning

Key findings:

  • Temporal features (hour, day) are strong predictors
  • Text features with bigrams perform well
  • Sentiment analysis provides marginal improvement
  • Simple models (Logistic Regression) are competitive with complex ensembles

πŸ” Code Review Findings

Issues in Original Code

  1. Deprecated Libraries (Fixed βœ…)

    • Old sklearn imports
    • Outdated pandas JSON normalization
    • Legacy string formatting
  2. Code Duplication (Fixed βœ…)

    • Repetitive data preprocessing
    • Duplicate model evaluation code
    • Multiple similar notebooks
  3. Hardcoded Values (Fixed βœ…)

    • File paths
    • Magic numbers
    • Configuration scattered throughout
  4. Missing Best Practices (Fixed βœ…)

    • No type hints
    • Limited error handling
    • No virtual environment specification
    • Inconsistent naming conventions

πŸ“ Project Structure

Classification-Pizza/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ train.json          # Training data
β”‚   └── test.json           # Test data
β”œβ”€β”€ pizza_classifier.py     # ✨ NEW: Modern implementation
β”œβ”€β”€ requirements.txt        # ✨ NEW: Dependency management
β”œβ”€β”€ README.md              # ✨ UPDATED: This file
β”œβ”€β”€ Clean_Notebook_Compiled.ipynb
β”œβ”€β”€ W207_Final_Project_Baseline_v4.ipynb
└── [Other legacy notebooks...]

πŸ› οΈ Technology Stack

  • Python: 3.11+
  • ML Frameworks: scikit-learn, XGBoost, LightGBM
  • NLP: NLTK, VADER Sentiment
  • Data: pandas, numpy
  • Visualization: matplotlib, seaborn

πŸ“ Features

Text Features

  • TF-IDF / Count vectorization
  • N-gram analysis (unigrams, bigrams, trigrams)
  • Sentiment scores (VADER)
  • Text preprocessing and cleaning

Temporal Features

  • Hour of day
  • Day of week
  • Month/seasonality
  • Weekend indicator

User Features

  • Account age
  • Reddit karma (upvotes/downvotes)
  • Comment/post ratios
  • Subreddit activity
  • Previous pizza requests

🎯 Models Implemented

  • Logistic Regression (L1/L2 regularization)
  • Random Forest
  • Naive Bayes
  • Support Vector Machines
  • Gradient Boosting
  • AdaBoost
  • Ensemble methods

πŸ“š Data

Data is from the Kaggle competition: Random Acts of Pizza

The dataset contains:

  • 4,040 training samples
  • 1,630 test samples
  • 32 features per sample
  • Binary classification target

Data is located in the /data directory.

🀝 Contributing

This is an educational project. Feel free to fork and experiment!

πŸ“„ License

MIT License - see LICENSE file for details

πŸ‘₯ Original Authors

W207 Final Project Team:

  • Erika Lawrence
  • Leslie Teo
  • Jen Jen Chen
  • Geoff Stirling

πŸ”„ Modernization

Code modernization and updates: 2025

About

Code to predict free pizza deliveries

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5