Skip to content

Latest commit

 

History

History
168 lines (126 loc) · 10.1 KB

File metadata and controls

168 lines (126 loc) · 10.1 KB

CodeBridge

CodeBridge is a project dedicated to optimizing large language models (LLMs) for low-resource programming languages (LRPLs) like Cangjie. It utilizes CodeBridge, a three-stage transfer learning approach, to enhance code completion accuracy by leveraging knowledge from high-resource programming languages (HRPLs) such as Java and Rust. Additionally, Retrieval-Augmented Generation (RAG) is incorporated during inference to improve performance.

Project Structure

  • .env: Environment variables for API keys.
  • dataset: Contains datasets for training and evaluation.
  • LLaMA-Factory: Framework used for fine-tuning LLMs.
  • src: Core source code directory, containing:
    • metric/: Scripts for evaluating model performance.
    • rag/: Implementations of Retrieval-Augmented Generation (RAG).
    • tree_sitter_cj/: Cangjie code parsing utilities.
    • data_cleaning.py: Scripts for preprocessing datasets.
    • inference.py: Hugging Face transformers-based inference scripts.
  • inference.ipynb: Jupyter notebook for inference using vLLM.
  • train.sh: Shell script for training the model.
  • llm.py: Interface for interacting with the model.
  • requirements.txt: Dependencies for the project.

CodeBridge: Three-Stage Fine-Tuning Process

CangjieLLM adopts CodeBridge, a novel three-stage training strategy that improves code completion for low-resource programming languages (LRPLs) through transfer learning from high-resource languages (HRPLs).

Overview

Training Strategy

  1. Teaching Phase:

    • Dataset: Cangjie corpus (~8M tokens)
    • Epochs: 4
    • Learning Rate: 2e-5
    • Goal: Rapidly expose the model to Cangjie's syntax and semantics.
  2. Practice Phase:

    • Dataset: Java/Rust corpus (~24M tokens)
    • Epochs: 1
    • Learning Rate: 7e-6
    • Goal: Enhance structural and semantic understanding by leveraging high-resource programming languages.
  3. Correction Phase:

    • Dataset: Cangjie corpus (same as step 1)
    • Epochs: 4
    • Learning Rate: 5e-6
    • Goal: Fine-tune the model to rectify transfer-induced biases.

Training Strategy

Dataset Preparation

  • Data Sources:
    • Cangjie dataset from Huawei repositories (Cangjie-SIG, Cangjie-TPC, HW-PLLab).
    • Java/Rust dataset from StarCoder preprocessed corpus.
  • Data Cleaning:
    • File filtering based on size, encoding, character composition, and comment removal.
    • Deduplication using a 90% similarity threshold.
  • Data Splitting:
    • 20 projects used for evaluation (held-out test set).
    • Remaining data used for training.

Inference with RAG and Prefix Matching

For inference, Retrieval-Augmented Generation (RAG) is combined with a prefix-matching strategy to improve code completion accuracy.

Prefix-Matching Decoding Strategy

  • If the input ends with a space → Extract the preceding non-space segment as prefix.
  • If the input ends with a symbol → Use context-based matching to determine the appropriate completion.

This method ensures that the generated output aligns more accurately with user expectations.

Experimental Setup

  • Hardware: 4x A100 GPUs (80GB)
  • Batch Size: 4
  • Training Time:
    • Teaching Phase: 36 hours
    • Practice Phase: 10 hours
    • Correction Phase: 36 hours

Metrics

The metric module evaluates performance using:

  • Exact Match Rate (EM): Measures the percentage of perfect matches.
  • Edit Similarity (ES): Computes edit distance similarity.
  • Line Accuracy: Percentage of correctly generated lines within a block.

Results

1. Effectiveness of CodeBridge (RQ1)

Setting Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
Baseline (Untrained Model) 35.49% 0.6699 25.15%
Teaching Only 44.07% 0.7397 31.94%
No Transfer Learning (High LR) 49.44% 0.7645 30.51%
No Transfer Learning (Low LR) 46.53% 0.7568 30.70%
Transfer Learning First 47.43% 0.7563 31.91%
Full Three-Step Strategy 52.35% 0.7692 33.27%

2. Impact of Training Configurations (RQ2)

This section explores how different training settings, such as transfer data volume, final-stage learning rate, and number of epochs, influence both line-level and function-level performance.

Final Stage LR Transfer Data Volume Epochs Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
Cosine LR - 8 45.64% 0.7432 31.78%
5e-6 1:3 (24M tokens) 4+1+4 52.35% 0.7692 33.27%
5e-6 1:1 (8M tokens) 4+1+4 40.49% 0.7165 32.00%
5e-6 1:5 (40M tokens) 4+1+4 48.32% 0.7659 32.10%
1e-5 1:3 (24M tokens) 4+1+4 47.65% 0.7660 32.70%
3e-6 1:3 (24M tokens) 4+1+4 48.77% 0.7616 32.57%
5e-6 1:3 (24M tokens) 2+1+2 48.55% 0.7497 32.16%
5e-6 1:3 (24M tokens) 3+1+3 46.98% 0.7536 32.87%

3. Generalizability of CodeBridge (RQ3)

To assess CodeBridge's generalizability, we evaluate its effectiveness across different LLM architectures and model sizes, analyzing line-level and function-level performance.

Model Training Step Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
CodeLlama-13B-Instruct Origin 32.21% 0.6695 27.24%
Step 1 56.60% 0.8082 33.68%
Step 2 50.56% 0.7900 32.10%
Step 3 57.94% 0.8135 34.13%
No Transfer 56.82% 0.8142 33.90%
Qwen2.5-14B-Instruct Origin 29.69% 0.6342 20.60%
Step 1 46.98% 0.7473 25.75%
Step 2 32.89% 0.7034 21.89%
Step 3 50.56% 0.7717 27.12%
No Transfer 48.10% 0.7446 26.64%
DeepSeek-Coder-1.3B-Instruct Origin 27.46% 0.5518 20.80%
Step 1 43.53% 0.7017 26.39%
Step 2 39.96% 0.7150 22.83%
Step 3 44.20% 0.7179 27.28%
No Transfer 43.97% 0.7059 26.69%
DeepSeek-Coder-6.7B-Instruct Origin 32.37% 0.6147 23.58%
Step 1 51.79% 0.7683 30.85%
Step 2 39.53% 0.6546 24.60%
Step 3 54.02% 0.7807 31.45%
No Transfer 52.90% 0.7764 30.94%

Setup

  1. Clone the repository and install dependencies:
    pip install -r requirements.txt
    cd LLaMA-Factory
    pip install -e ".[torch,metrics]"
  2. Prepare the dataset and store dataset metadata in dataset_info.json.
  3. Modify train.sh to specify the training configuration.
  4. Run training:
    bash train.sh
  5. For inference, use inference.ipynb.

Contribution

We welcome contributions! Feel free to open issues or submit pull requests.