MFCS-Bench

MFCS-Bench is a benchmark system for evaluating large language models (LLMs) on function calling tasks, based on the MFCS (Model Function Calling Standard) protocol. It standardizes evaluation of how well different models handle structured function calls, helping build a more robust tool-using LLM ecosystem.

中文文档

🚀 Features

✅ MFCS Protocol Compatible: Unified interface for evaluating function calls across different LLMs
📊 Comprehensive Metrics: Tool usage rate, semantic match rate, accuracy, and response time
🔄 Streaming Support: Real-time response analysis with streaming output
📈 Detailed Reports: Both summary and detailed markdown reports with test analytics
🔁 Automated Pipeline: Fully automated benchmark workflow
⚡ Batch Execution: Test cases are executed in batches for optimal performance and resource control
📝 Batch Report Generation: Benchmark reports are generated per batch, suitable for large-scale evaluation
📥 Batch Config & Test Case Loading: Configuration and test cases are loaded in batches for smoother operation
🌍 Multi-language Support: Supports both English and Chinese test cases and reports

📦 Installation & Requirements

git clone https://github.com/mfcsorg/mfcs-bench.git
cd mfcs-bench
pip install -e .
pip install -r requirements.txt
# For Python example:
pip install -r apps/mfcs-python/requirements.txt

Python 3.7+
Required: aiofiles, sentence-transformers
For Python example: mfcs, openai==1.93.3

🔧 Quick Start

Configure your test cases in test_cases/ directory
Set up your application config in apps/config.json
Set up your model config in models/config.json
Set up your tool config in tools/ directory
Run the benchmark:

# Basic usage
python run_benchmark.py

# Verbose output
python run_benchmark.py --verbose

Or run the Python example directly:

python apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.json

Results will be saved to the reports/ directory with timestamp-based filenames:

report_YYYYMMDD_HHMMSS_<language>.md: Benchmark report (includes both summary and detailed analysis)

📁 Project Structure

mfcs-bench/
├── apps/              # Application configs & examples
│   ├── config.json    # Main config
│   ├── mfcs-python/   # Python example
│   └── mfcs-js/       # JS example
├── models/            # Model configs
├── tools/             # Tool configs (English & Chinese)
├── reports/           # Benchmark reports
├── src/               # Core implementation
│   └── mfcs_bench/
│       └── core/
├── test_cases/        # Test cases (English & Chinese)
└── run_benchmark.py   # Main entry

📊 Evaluation Metrics

Metric	Description
Tool Usage Rate	Percentage of correct tool usage in responses
Semantic Match Rate	Accuracy of semantic content matching
Response Time	Average time taken to generate responses
Token Usage	Prompt and completion token consumption
Success Rate	Overall test case success percentage

Note: The current implementation focuses on tool calling accuracy and test case success rate.

📢 Test Case Format

Test cases are defined in JSON format:

[
    {
        "name": "Creative Task - Poetry Writing",
        "question": "Help me write a poem about missing someone",
        "should_call_api": false,
        "expected_tool_name": null,
        "test_type": "Avoid Unnecessary Calls"
    },
    {
        "name": "Real-time News Request - Today's News",
        "question": "What important news is there today?",
        "should_call_api": true,
        "expected_tool_name": "news_access_service_685c9b5c2de60791fbd5c7cc",
        "test_type": "Tool Name Correctness"
    }
]

📢 Contribute

We welcome contributions!

Add new test cases
Improve evaluation metrics
Enhance report generation
Add support for more LLM implementations

📜 License

MIT License

⚙️ Configuration

apps/config.json: Application and argument configuration
models/config.json: Model list and API info
tools/: Tool definitions (English & Chinese versions)

Example: apps/config.json

{
    "mfcs-python": {
        "command": "python",
        "stream": true,
        "args": [
            "apps/mfcs-python/mfcs-python.py",
            "--model_config=./models/config.json",
            "--tools=./tools/agent_tools_en.json",
            "--test_cases=./test_cases/test_progress_en.json",
            "--report_language=en"
        ]
    }
}

Example: models/config.json

{
    "moonshot-v1-8k": {
        "name": "Kimi-8k",
        "api_base": "https://api.moonshot.cn/v1",
        "api_key": "your-api-key-here"
    }
}

Example: tools/agent_tools_en.json

[
    {
        "parameters": {
            "type": "object",
            "properties": {
                "content": {
                    "type": "string",
                    "description": "The content of the message, supporting various types of content including plain text, multimodal (mixed input of text, images, files), and other formats."
                }
            },
            "required": ["content"]
        },
        "name": "elder_film_service_685c9b642de60791fbd5c7d2",
        "description": "A professional cinema guide with extensive film knowledge, capable of recommending suitable movies for users, answering various film-related questions, and providing comprehensive movie viewing guidance and services using clear, vivid, and easy-to-understand language."
    }
]

🏃 Command Line Usage

Benchmark Runner

# Basic usage
python run_benchmark.py

# Advanced options
python run_benchmark.py --verbose --config apps/config.json

Batch Execution

The benchmark uses optimized batch execution for optimal performance:

Fixed batch size: 10 concurrent test cases per batch
Performance impact: Significantly faster than sequential execution
Resource control: Prevents resource exhaustion from full concurrency
Optimized for: Most production environments

See BATCH_EXECUTION.md for detailed information about the batch execution mechanism.

Supported Key Arguments:

--config: Path to configuration file (default: apps/config.json)
--reports-dir: Directory to store reports (default: reports)
--verbose or -v: Enable verbose logging

Example:

python run_benchmark.py --config=apps/config.json --reports-dir=reports -v

Python Example

python apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.json --test_index=<index>

Arguments:

--model: Path to model config
--model_name: Model ID
--tools: Path to tool config
--test_cases: Path to test cases file
--test_index: (optional) Specific test case index to run

📊 Evaluation & Reports

Batch evaluation of all models and test cases

Reports include: model, test case, accuracy, response time, tool usage, etc.
Reports are generated in the specified language (English or Chinese)
Markdown reports saved in reports/ with timestamp and language suffix
Report format: report_YYYYMMDD_HHMMSS_<language>.md

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
apps		apps
models		models
reports		reports
src/mfcs_bench		src/mfcs_bench
test_cases		test_cases
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MFCS-Bench

🚀 Features

📦 Installation & Requirements

🔧 Quick Start

📁 Project Structure

📊 Evaluation Metrics

📢 Test Case Format

📢 Contribute

📜 License

⚙️ Configuration

Example: apps/config.json

Example: models/config.json

Example: tools/agent_tools_en.json

🏃 Command Line Usage

Benchmark Runner

Batch Execution

Python Example

📊 Evaluation & Reports

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MFCS-Bench

🚀 Features

📦 Installation & Requirements

🔧 Quick Start

📁 Project Structure

📊 Evaluation Metrics

📢 Test Case Format

📢 Contribute

📜 License

⚙️ Configuration

Example: apps/config.json

Example: models/config.json

Example: tools/agent_tools_en.json

🏃 Command Line Usage

Benchmark Runner

Batch Execution

Python Example

📊 Evaluation & Reports

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages