This repository provides a complete pipeline for preprocessing the CodeQA dataset and fine-tuning code language models using Full Fine-Tuning or DoRA Parameter-Efficient Fine-Tuning (PEFT). The goal of this project is to evaluate the effectiveness of PEFT on CodeQA, an application that has previously been underexplored.
- Efficient Fine-Tuning: DoRA adapts only a small set of rank-decomposed matrices instead of updating every model weight.
- Effective Preprocessing: The pipeline effectively parses and grammar-corrects the CodeQA dataset.
- Reproducibility: All scripts contain sensible arguments and defaults for reproducibility.
.
├── preprocessing/
│ ├── preprocess.py
│ ├── grammar_correction.py
├── scripts/
│ └── run_preprocessing.sh
├── finetuning/
│ ├── full_ft.py
│ └── lora_ft.py
├── evaluation/
│ └── eval.py
├── README.md
-
preprocessing/preprocess.py
Parses and formats the raw CodeQA dataset. -
preprocessing/grammar_correction.py
Applies grammar correction to natural language questions and answers in the dataset. -
preprocessing/data_formatting.pyFormats the dataset into a structure suitable for training and evaluation. -
scripts/run_preprocessing.sh
Shell script to run the preprocessing pipeline in the correct order with appropriate arguments.
-
finetuning/full_ft.py
Script for full model fine-tuning using CodeT5+ or CodeBERT on the preprocessed CodeQA dataset. -
finetuning/lora_ft.py
Script for DoRA parameter-efficient fine-tuning on the same dataset.
evaluation/eval.py
Evaluates model checkpoints on the CodeQA dataset, reporting relevant success metrics.