MFCS-Bench is a benchmark system for evaluating large language models (LLMs) on function calling tasks, based on the MFCS (Model Function Calling Standard) protocol. It standardizes evaluation of how well different models handle structured function calls, helping build a more robust tool-using LLM ecosystem.
- ✅ MFCS Protocol Compatible: Unified interface for evaluating function calls across different LLMs
- 📊 Comprehensive Metrics: Tool usage rate, semantic match rate, accuracy, and response time
- 🔄 Streaming Support: Real-time response analysis with streaming output
- 📈 Detailed Reports: Both summary and detailed markdown reports with test analytics
- 🔁 Automated Pipeline: Fully automated benchmark workflow
- ⚡ Batch Execution: Test cases are executed in batches for optimal performance and resource control
- 📝 Batch Report Generation: Benchmark reports are generated per batch, suitable for large-scale evaluation
- 📥 Batch Config & Test Case Loading: Configuration and test cases are loaded in batches for smoother operation
- 🌍 Multi-language Support: Supports both English and Chinese test cases and reports
git clone https://github.com/mfcsorg/mfcs-bench.git
cd mfcs-bench
pip install -e .
pip install -r requirements.txt
# For Python example:
pip install -r apps/mfcs-python/requirements.txt- Python 3.7+
- Required:
aiofiles,sentence-transformers - For Python example:
mfcs,openai==1.93.3
- Configure your test cases in
test_cases/directory - Set up your application config in
apps/config.json - Set up your model config in
models/config.json - Set up your tool config in
tools/directory - Run the benchmark:
# Basic usage
python run_benchmark.py
# Verbose output
python run_benchmark.py --verboseOr run the Python example directly:
python apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.jsonResults will be saved to the reports/ directory with timestamp-based filenames:
report_YYYYMMDD_HHMMSS_<language>.md: Benchmark report (includes both summary and detailed analysis)
mfcs-bench/
├── apps/ # Application configs & examples
│ ├── config.json # Main config
│ ├── mfcs-python/ # Python example
│ └── mfcs-js/ # JS example
├── models/ # Model configs
├── tools/ # Tool configs (English & Chinese)
├── reports/ # Benchmark reports
├── src/ # Core implementation
│ └── mfcs_bench/
│ └── core/
├── test_cases/ # Test cases (English & Chinese)
└── run_benchmark.py # Main entry
| Metric | Description |
|---|---|
| Tool Usage Rate | Percentage of correct tool usage in responses |
| Semantic Match Rate | Accuracy of semantic content matching |
| Response Time | Average time taken to generate responses |
| Token Usage | Prompt and completion token consumption |
| Success Rate | Overall test case success percentage |
Note: The current implementation focuses on tool calling accuracy and test case success rate.
Test cases are defined in JSON format:
[
{
"name": "Creative Task - Poetry Writing",
"question": "Help me write a poem about missing someone",
"should_call_api": false,
"expected_tool_name": null,
"test_type": "Avoid Unnecessary Calls"
},
{
"name": "Real-time News Request - Today's News",
"question": "What important news is there today?",
"should_call_api": true,
"expected_tool_name": "news_access_service_685c9b5c2de60791fbd5c7cc",
"test_type": "Tool Name Correctness"
}
]We welcome contributions!
- Add new test cases
- Improve evaluation metrics
- Enhance report generation
- Add support for more LLM implementations
MIT License
apps/config.json: Application and argument configurationmodels/config.json: Model list and API infotools/: Tool definitions (English & Chinese versions)
{
"mfcs-python": {
"command": "python",
"stream": true,
"args": [
"apps/mfcs-python/mfcs-python.py",
"--model_config=./models/config.json",
"--tools=./tools/agent_tools_en.json",
"--test_cases=./test_cases/test_progress_en.json",
"--report_language=en"
]
}
}{
"moonshot-v1-8k": {
"name": "Kimi-8k",
"api_base": "https://api.moonshot.cn/v1",
"api_key": "your-api-key-here"
}
}[
{
"parameters": {
"type": "object",
"properties": {
"content": {
"type": "string",
"description": "The content of the message, supporting various types of content including plain text, multimodal (mixed input of text, images, files), and other formats."
}
},
"required": ["content"]
},
"name": "elder_film_service_685c9b642de60791fbd5c7d2",
"description": "A professional cinema guide with extensive film knowledge, capable of recommending suitable movies for users, answering various film-related questions, and providing comprehensive movie viewing guidance and services using clear, vivid, and easy-to-understand language."
}
]# Basic usage
python run_benchmark.py
# Advanced options
python run_benchmark.py --verbose --config apps/config.jsonThe benchmark uses optimized batch execution for optimal performance:
- Fixed batch size: 10 concurrent test cases per batch
- Performance impact: Significantly faster than sequential execution
- Resource control: Prevents resource exhaustion from full concurrency
- Optimized for: Most production environments
See BATCH_EXECUTION.md for detailed information about the batch execution mechanism.
Supported Key Arguments:
--config: Path to configuration file (default:apps/config.json)--reports-dir: Directory to store reports (default:reports)--verboseor-v: Enable verbose logging
Example:
python run_benchmark.py --config=apps/config.json --reports-dir=reports -vpython apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.json --test_index=<index>Arguments:
--model: Path to model config--model_name: Model ID--tools: Path to tool config--test_cases: Path to test cases file--test_index: (optional) Specific test case index to run
Batch evaluation of all models and test cases
- Reports include: model, test case, accuracy, response time, tool usage, etc.
- Reports are generated in the specified language (English or Chinese)
- Markdown reports saved in
reports/with timestamp and language suffix - Report format:
report_YYYYMMDD_HHMMSS_<language>.md