Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: CI

on:
push:
branches: ["**"]
pull_request:

jobs:
quality:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-dev.txt

- name: Ruff check
run: ruff check src

- name: Mypy check
run: mypy src

- name: Pytest
run: pytest
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ build/
# Test results
*.cover
coverage.xml
*.pytest_cache/
.pytest_cache/
.ruff_cache/
.coverage

# Jupyter Notebook
.ipynb_checkpoints
Expand Down
172 changes: 79 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,112 @@
# DGF: 基于Prompt的Fuzz Driver自生成系统
# DGF: 基于 Prompt 的 Fuzz Driver 自动生成系统

> 本项目是 PromptFuzz 论文《Prompt Fuzzing for Fuzz Driver Generation》 (CCS 2024) 的全面复现实现,并扩展了调用链分析技术,用于增强 LLM 生成合理 API 调用系列的能力
本项目实现了从头文件 API 抽取、Prompt 构造、LLM 代码生成、编译验证、fuzz 执行到覆盖率反馈迭代的完整流程

---
## 1. 模块流程

## 一、项目概述

DGF 自动生成高质量的 fuzz driver,进行应用程序库的默黑模糊测试,根据覆盖率反馈和程序验证进行迭代优化,复现 PromptFuzz 论文中的核心技术思路。

---

## 二、系统模块构成

```
头文件 --> 头文解析 (Header Parser) --> API 签名
|
v
调用链分析 (Call Chain Analysis)
|
v
Prompt 生成 (Prompt Generator) --> LLM 生成代码
|
v
程序验证 (Validator)
|
v
覆盖率收集 (Coverage Collector)
|
v
Prompt 变异 (Prompt Mutation) <--- 反馈控制 (Feedback Controller)
```text
Header Parser -> Prompt Generator -> LLM Code -> Validator -> Fuzzer -> Coverage -> Feedback
```

---

## 三、目录结构

```
DGF-main/
|
├— src/
| ├— main.py # 总控制入口
| ├— config/ # 配置文件
| ├— dgf_header_parser/ # 头文解析和API提取
| ├— dgf_prompt_generator/ # Prompt生成和LLM调用
| ├— dgf_validator/ # 程序验证模块
| ├— dgf_feedback/ # 反馈控制与覆盖率收集
| └— dgf_pipeline/ # 完整流水线执行控制
|
└— README.md
核心入口:`src/main.py`

## 2. 目录结构

```text
.
├── src/
│ ├── main.py
│ ├── config/experiment.yaml
│ ├── dgf_header_parser/
│ ├── dgf_prompt_generator/
│ ├── dgf_validator/
│ ├── dgf_feedback/
│ ├── dgf_pipeline/
│ └── dgf_common/
├── tests/
├── requirements.txt
├── requirements-dev.txt
└── pyproject.toml
```

---
## 3. 环境要求

## 四、快速使用
- Python 3.9+
- clang/llvm(建议 14+)
- 支持 libFuzzer 的编译环境

### 1.环境供与
可选环境变量:

- Python 3.8+
- clang, llvm, lcov, cmake
- 支持libFuzzer的编译环境
- 安装Python依赖:
- `OPENAI_API_KEY`(必需,除非使用本地 `src/dgf_prompt_generator/config.py`)
- `OPENAI_BASE_URL`(可选)
- `OPENAI_MODEL`(默认 `gpt-4.1-mini`)
- `OPENAI_TEMPERATURE`(默认 `0.2`)
- `LIBCLANG_PATH`(可选,如 `/usr/lib/llvm-14/lib/libclang.so.1`)

## 4. 安装依赖

```bash
python -m venv venv
source venv/bin/activate
pip install -r src/dgf_prompt_generator/requirements.txt
pip install -r src/dgf_validator/requirements.txt
pip install -r src/dgf_feedback/requirements.txt
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
```

### 2.目标库准备
## 5. 目标库准备(以 cJSON 为例)

将测试库的源码和头文件放入指定路径,如:
将目标库放到:

```
testdata/cJSON/
```text
testdata/cJSON
```

### 3.运行全流程
并确保其可被 clang include/link(`src/config/experiment.yaml` 已给出默认路径模板)。

```bash
cd src/
python main.py --config config/experiment.yaml
```

### 4.运行结果
## 6. 运行方式

- 生成种子seed程序
- 生成fuzz driver并执行libFuzzer测试
- 生成覆盖率和bug报告
### 6.1 运行完整流程

---
```bash
PYTHONPATH=src python src/main.py --config src/config/experiment.yaml
```

## 五、配置文件
### 6.2 单独运行 feedback pipeline

根本配置文件位于 `config/experiment.yaml`,具体包括:
```bash
PYTHONPATH=src python src/dgf_pipeline/run_pipeline.py \
--api_json data/extracted_api.json \
--output_dir data/feedback_results \
--samples 5 \
--clang_path clang \
--include_dirs testdata/cJSON \
--lib_dir testdata/cJSON/build \
--libs cjson cjson_utils
```

- `library_path`:库源码路径
- `header_path`:头文件路径
- `clang_bin`:clang编译器路径
- `llm_provider`:设置LLM接口和API密钥
- `mutation_params`:Prompt变异策略参数
## 7. 配置说明

---
主配置:`src/config/experiment.yaml`

## 六、项目特性
- `api_extraction`:头文件扫描路径、include 路径、抽取 JSON 输出位置
- `prompt_generation`:seed driver 数量、每个 driver 的 API 数、include 模板、API 前缀过滤
- `feedback_iteration`:每轮样本数与 fuzz 超时
- `validator`:clang 路径、include 路径、库目录与库名

- 完全复现 PromptFuzz 核心设计
- 基于覆盖率的 Prompt 变异和能量调度
- 多阶验证(编译+sanitizer+fuzzing)
- 集成 AFLFast 风格的 API energy scheduling
- 增强 **调用链分析** (扩展部分)
- 支持可复现性实验
## 8. 开发与质量检查

---
```bash
ruff check src
mypy src
pytest
```

## 七、参考文献
仓库已包含 GitHub Actions 工作流(`.github/workflows/ci.yml`)用于自动执行上述检查。

- PromptFuzz: Prompt Fuzzing for Fuzz Driver Generation
- CCS 2024, Yunlong Lyu et al.
- 本实现在此基础上扩展了静态程序分析分支,增强了生成合理性
## 9. 本地 LLM 配置(可选)

---
如不想依赖环境变量,可复制:

```text
src/dgf_prompt_generator/config.example.py -> src/dgf_prompt_generator/config.py
```

并填写 API 配置。`config.py` 默认不提交到仓库。
23 changes: 23 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[tool.pytest.ini_options]
pythonpath = ["src"]
testpaths = ["src", "tests"]
addopts = "-q"

[tool.ruff]
line-length = 100
target-version = "py39"
src = ["src"]

[tool.ruff.lint]
select = ["E", "F", "I", "W"]
ignore = ["E501"]

[tool.mypy]
python_version = "3.9"
mypy_path = "src"
namespace_packages = true
explicit_package_bases = true
ignore_missing_imports = true
check_untyped_defs = false
warn_unused_ignores = true
no_implicit_optional = false
5 changes: 5 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-r requirements.txt
pytest
ruff
mypy
types-PyYAML
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
openai
tqdm
PyYAML
clang
27 changes: 19 additions & 8 deletions src/config/experiment.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,34 @@
api_extraction:
header_dir: /home/lanjiachen/DGF/testdata/cJSON
header_dir: testdata/cJSON
include_dirs:
- /home/lanjiachen/DGF/testdata/cJSON
extracted_api_json: /home/lanjiachen/DGF/src/data/extracted_api.json
- testdata/cJSON
extracted_api_json: data/extracted_api.json

prompt_generation:
output_dir: /home/lanjiachen/DGF/src/data/seed_prompts
output_dir: data/seed_prompts
samples: 2
num_funcs: 5
system_includes:
- stdint.h
- stddef.h
- stdio.h
- stdlib.h
- string.h
- cJSON.h
- cJSON_Utils.h
api_prefixes:
- cJSON

feedback_iteration:
output_dir: /home/lanjiachen/DGF/src/data/feedback_results
output_dir: data/feedback_results
samples_per_round: 2
fuzz_timeout_sec: 20

validator:
clang_path: clang-14
clang_path: clang
include_dirs:
- /home/lanjiachen/DGF/testdata/cJSON
lib_dir: /home/lanjiachen/DGF/testdata/cJSON/build
- testdata/cJSON
lib_dir: testdata/cJSON/build
libs:
- cjson
- cjson_utils
1 change: 1 addition & 0 deletions src/dgf_common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Common shared helpers for DGF modules.
17 changes: 17 additions & 0 deletions src/dgf_common/code_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import re

_FENCED_CODE_PATTERN = re.compile(r"```(?:c|C|cpp|c\+\+)?\s*(.*?)```", re.DOTALL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex alternation order breaks cpp/c++ code extraction

Medium Severity

The regex (?:c|C|cpp|c\+\+)? tries alternatives left-to-right. When an LLM returns a ```cpp fenced block, the c alternative matches first (consuming only the c), leaving pp\n... to be captured by (.*?). The extracted code will be prefixed with pp\n, producing invalid C that fails to compile. The longer alternatives cpp and c\+\+ need to appear before c in the alternation.

Fix in Cursor Fix in Web



def extract_c_code_block(raw_text):
"""
Extract C/C++ code from markdown fenced block.
If no fenced block is present, return stripped raw text.
"""
if raw_text is None:
return ""

match = _FENCED_CODE_PATTERN.search(raw_text)
if match:
return match.group(1).strip()
return raw_text.strip()
12 changes: 12 additions & 0 deletions src/dgf_common/logging_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import logging
import os


def configure_logging(default_level="INFO"):
level_name = os.getenv("DGF_LOG_LEVEL", default_level).upper()
level = getattr(logging, level_name, logging.INFO)

logging.basicConfig(
level=level,
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
12 changes: 11 additions & 1 deletion src/dgf_feedback/api_manager.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# src/dgf_feedback/api_manager.py

import logging
import random
from collections import defaultdict

LOGGER = logging.getLogger(__name__)

class APIManager:
def __init__(self, api_list, exponent=1.0):
self.api_list = api_list # list of api function names
Expand Down Expand Up @@ -40,4 +43,11 @@ def sample_api_combination(self, num_funcs):

def print_state(self):
for api in self.api_list:
print(f"{api}: cov={self.coverage[api]:.2f}, seed={self.seed_count[api]}, prompt={self.prompt_count[api]}, energy={self.get_energy(api):.4f}")
LOGGER.info(
"%s: cov=%.2f, seed=%d, prompt=%d, energy=%.4f",
api,
self.coverage[api],
self.seed_count[api],
self.prompt_count[api],
self.get_energy(api),
)
Loading