Multilingual topic modeling pipeline for student feedback analysis. Handles Cebuano, Tagalog, English, and code-switching using LaBSE embeddings and BERTopic.
- Preprocesses raw student feedback (cleans noise, drops gibberish)
- Embeds text using LaBSE (multilingual sentence embeddings)
- Discovers topics using BERTopic (UMAP + HDBSCAN + c-TF-IDF)
- Evaluates with NPMI coherence, topic diversity, silhouette score
- Reports results to Discord for async review
- Python 3.11
- LaBSE via sentence-transformers (768-dim multilingual embeddings)
- BERTopic for topic modeling
- CUDA GPU support (RTX 2060, 6GB VRAM)
- Discord notifications via bot token
cd /home/yander/Documents/codes/faculytics/topic-modeling.faculyticsuv venv .venv --python 3.11
source .venv/bin/activateuv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv pip install sentence-transformers bertopic umap-learn hdbscan gensim scikit-learn
uv pip install pandas numpy tqdm requests python-dotenvCreate .env in repo root:
DISCORD_BOT_TOKEN=your_bot_token_herePlace in data/:
uc_dataset_20krows1.csv— real UC student feedbackfeedback_augmented_v1.json— curated augmented dataset
# Baseline run on augmented dataset
python scripts/run_eval.py --dataset augmented --run-id 001
# Run on real dataset with custom params
python scripts/run_eval.py --dataset real --run-id 002 \
--min-topic-size 10 \
--umap-n-neighbors 20
# Force re-embed (ignore cache)
python scripts/run_eval.py --dataset augmented --run-id 003 --force-embed
# Skip Discord notification
python scripts/run_eval.py --dataset augmented --run-id test --no-notifyFor async runs (Y.A.L.A. triggers, Yander sleeps):
- Y.A.L.A. runs:
python scripts/run_eval.py --dataset augmented --run-id 001 - Results auto-post to Discord channel
#topic-modeling-results - Yander reviews in the morning, provides feedback
- Y.A.L.A. adjusts params, runs next iteration
topic-modeling.faculytics/
├── src/
│ ├── config.py # Constants, paths, device selection
│ ├── preprocess.py # Text cleaning for multilingual feedback
│ ├── embed.py # LaBSE embedding with caching
│ ├── topic_model.py # BERTopic wrapper
│ ├── evaluate.py # Metrics: NPMI, diversity, silhouette
│ └── notify.py # Discord posting
├── scripts/
│ └── run_eval.py # Main evaluation entry point
├── data/
│ ├── uc_dataset_20krows1.csv
│ ├── feedback_augmented_v1.json
│ └── embeddings_cache_*.npy # Auto-generated
├── experiments/
│ └── run_001/ # Per-run artifacts
│ ├── params.json
│ ├── metrics.json
│ └── topics.md
├── .env # Discord bot token (gitignored)
├── TUNING_LOG.md # Y.A.L.A.'s persistent memory
└── README.md
| Metric | Target | Description |
|---|---|---|
| NPMI Coherence | > 0.1 | Topic word co-occurrence quality |
| Topic Diversity | > 0.7 | Uniqueness of keywords across topics |
| Outlier Ratio | < 20% | Documents not assigned to any topic |
| Silhouette Score | > 0.0 | Cluster separation quality |
- LaBSE handles code-switching naturally (trained on 109 languages)
- First embed run downloads ~2GB model, cached thereafter
- Embeddings cached per dataset — delete
data/embeddings_cache_*.npyto regenerate - Experiment artifacts saved to
experiments/run_{id}/