This repository contains a full pipeline for the detection and classification of Taglish (Tagalog-English) microaggressions. It includes data generation scripts, preprocessing tools, and a transformer-based classification model.
git clone https://github.com/kndlcero/nlp.git
cd nlpIt is highly recommended to use a virtual environment to manage your Python dependencies.
Create the environment:
python -m venv venvActivate the environment:
- Windows (PowerShell):
.\venv\Scripts\Activate.ps1 - Windows (CMD):
.\venv\Scripts\activate - Mac/Linux:
source venv/bin/activate
VS Code Setup:
- Open the Command Palette (Ctrl+Shift+P)
- Type "Python: Select Interpreter"
- Choose the one pointing to your local
./venv
pip install -r requirements.txtThe pipeline is designed to be executed chronologically.
Local Data Preparation:
Run scripts 1 through 5 locally to generate and clean the dataset:
python pipeline/1_synthetic_generator.py
python pipeline/2_real_world_loader.py
python pipeline/3_manual_loader.py
python pipeline/4_concatenator.py
python pipeline/5_enhancement.pyCloud Training (Recommended):
Due to local GPU/CUDA limitations, it is advised to run 6_training_pipeline.py on Google Colab.
- Upload the generated
taglish_microaggression_enhanced_v2.csvto Colab - Execute the training cell
- Expected Metric: F1 scores between 60-70%
If you wish to use the model immediately:
- Download Assets: Download the model files from this Google Drive Folder
- Place Assets: Ensure your folder structure matches the diagram below
- Run Inference:
python microaggression_classifier.pyFor the classifier to run correctly, ensure your local directory is organized as follows:
Microaggression/
├── pipeline/
│ ├── 1_synthetic_generator.py
│ ├── 2_real_world_loader.py
│ ├── 3_manual_loader.py
│ ├── 4_concatenator.py
│ ├── 5_enhancement.py
│ ├── 6_training_pipeline.py
│ ├── all_samples.txt
│ ├── other.csv files
│ └── taglish_microaggression_enhanced_v2.csv
├── taglish_tokenizer/
│ ├── sentencepiece.bpe.model
│ ├── special_tokens_map.json
│ └── tokenizer_config.json
├── venv/
├── .gitignore
├── best_microaggression_model.pt
├── label_mappings.json
├── microaggression_classifier.py
├── model_config.json
└── requirements.txt