Skip to content

kndlcero/Microaggression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taglish Microaggression Classifier

This repository contains a full pipeline for the detection and classification of Taglish (Tagalog-English) microaggressions. It includes data generation scripts, preprocessing tools, and a transformer-based classification model.


Getting Started

1. Clone the Repository

git clone https://github.com/kndlcero/nlp.git
cd nlp

2. Set Up Virtual Environment

It is highly recommended to use a virtual environment to manage your Python dependencies.

Create the environment:

python -m venv venv

Activate the environment:

  • Windows (PowerShell): .\venv\Scripts\Activate.ps1
  • Windows (CMD): .\venv\Scripts\activate
  • Mac/Linux: source venv/bin/activate

VS Code Setup:

  1. Open the Command Palette (Ctrl+Shift+P)
  2. Type "Python: Select Interpreter"
  3. Choose the one pointing to your local ./venv

3. Install Requirements

pip install -r requirements.txt

Usage Guide

Option A: Running the Full Pipeline (Training)

The pipeline is designed to be executed chronologically.

Local Data Preparation:

Run scripts 1 through 5 locally to generate and clean the dataset:

python pipeline/1_synthetic_generator.py
python pipeline/2_real_world_loader.py
python pipeline/3_manual_loader.py
python pipeline/4_concatenator.py
python pipeline/5_enhancement.py

Cloud Training (Recommended):

Due to local GPU/CUDA limitations, it is advised to run 6_training_pipeline.py on Google Colab.

  1. Upload the generated taglish_microaggression_enhanced_v2.csv to Colab
  2. Execute the training cell
  3. Expected Metric: F1 scores between 60-70%

Option B: Running the Pretrained Model (Inference)

If you wish to use the model immediately:

  1. Download Assets: Download the model files from this Google Drive Folder
  2. Place Assets: Ensure your folder structure matches the diagram below
  3. Run Inference:
python microaggression_classifier.py

Project Structure

For the classifier to run correctly, ensure your local directory is organized as follows:

Microaggression/
├── pipeline/
│   ├── 1_synthetic_generator.py
│   ├── 2_real_world_loader.py
│   ├── 3_manual_loader.py
│   ├── 4_concatenator.py
│   ├── 5_enhancement.py
│   ├── 6_training_pipeline.py
│   ├── all_samples.txt
│   ├── other.csv files
│   └── taglish_microaggression_enhanced_v2.csv
├── taglish_tokenizer/
│   ├── sentencepiece.bpe.model
│   ├── special_tokens_map.json
│   └── tokenizer_config.json
├── venv/
├── .gitignore
├── best_microaggression_model.pt
├── label_mappings.json
├── microaggression_classifier.py
├── model_config.json
└── requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages