A content-based filtering recommendation engine built on a 183K-item Amazon apparel dataset, using TF-IDF vectorization and cosine similarity to surface relevant product matches.
This project scrapes and preprocesses a large-scale Amazon apparel dataset, then builds a recommendation pipeline that matches products based on textual and categorical attributes — without relying on user history or collaborative signals.
- Data pipeline: Handles missing values, duplicates, and inconsistent categories across 183K+ items
- TF-IDF vectorization: Encodes 10+ product attributes (title, category, brand, description, etc.) into feature vectors
- Cosine similarity matching: Computes pairwise similarity to retrieve the most relevant recommendations
- Threshold tuning: Iteratively calibrated similarity thresholds and feature weights for precision improvement
- Baseline comparison: Evaluated against a popularity-based baseline across 5+ apparel categories
| Metric | Score |
|---|---|
| Recommendation precision | ~85% |
| Reduction in irrelevant results | ~30% over baseline |
- Python
- Scikit-learn — TF-IDF, cosine similarity
- Pandas — data preprocessing and feature engineering
- NumPy — numerical operations
- Raw dataset is scraped and cleaned (null handling, deduplication, category normalization)
- Product attributes are combined and vectorized using
TfidfVectorizer - Cosine similarity matrix is computed across all items
- Given a query product, the top-N most similar items are returned
- Precision is evaluated against manually labeled relevant items per category
Amazon apparel dataset — ~183,000 items across multiple categories (shirts, dresses, footwear, etc.)