Skip to content

mfcsorg/mfcs-bench

Repository files navigation

MFCS-Bench

MFCS-Bench is a benchmark system for evaluating large language models (LLMs) on function calling tasks, based on the MFCS (Model Function Calling Standard) protocol. It standardizes evaluation of how well different models handle structured function calls, helping build a more robust tool-using LLM ecosystem.

中文文档


🚀 Features

  • MFCS Protocol Compatible: Unified interface for evaluating function calls across different LLMs
  • 📊 Comprehensive Metrics: Tool usage rate, semantic match rate, accuracy, and response time
  • 🔄 Streaming Support: Real-time response analysis with streaming output
  • 📈 Detailed Reports: Both summary and detailed markdown reports with test analytics
  • 🔁 Automated Pipeline: Fully automated benchmark workflow
  • Batch Execution: Test cases are executed in batches for optimal performance and resource control
  • 📝 Batch Report Generation: Benchmark reports are generated per batch, suitable for large-scale evaluation
  • 📥 Batch Config & Test Case Loading: Configuration and test cases are loaded in batches for smoother operation
  • 🌍 Multi-language Support: Supports both English and Chinese test cases and reports

📦 Installation & Requirements

git clone https://github.com/mfcsorg/mfcs-bench.git
cd mfcs-bench
pip install -e .
pip install -r requirements.txt
# For Python example:
pip install -r apps/mfcs-python/requirements.txt
  • Python 3.7+
  • Required: aiofiles, sentence-transformers
  • For Python example: mfcs, openai==1.93.3

🔧 Quick Start

  1. Configure your test cases in test_cases/ directory
  2. Set up your application config in apps/config.json
  3. Set up your model config in models/config.json
  4. Set up your tool config in tools/ directory
  5. Run the benchmark:
# Basic usage
python run_benchmark.py

# Verbose output
python run_benchmark.py --verbose

Or run the Python example directly:

python apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.json

Results will be saved to the reports/ directory with timestamp-based filenames:

  • report_YYYYMMDD_HHMMSS_<language>.md: Benchmark report (includes both summary and detailed analysis)

📁 Project Structure

mfcs-bench/
├── apps/              # Application configs & examples
│   ├── config.json    # Main config
│   ├── mfcs-python/   # Python example
│   └── mfcs-js/       # JS example
├── models/            # Model configs
├── tools/             # Tool configs (English & Chinese)
├── reports/           # Benchmark reports
├── src/               # Core implementation
│   └── mfcs_bench/
│       └── core/
├── test_cases/        # Test cases (English & Chinese)
└── run_benchmark.py   # Main entry

📊 Evaluation Metrics

Metric Description
Tool Usage Rate Percentage of correct tool usage in responses
Semantic Match Rate Accuracy of semantic content matching
Response Time Average time taken to generate responses
Token Usage Prompt and completion token consumption
Success Rate Overall test case success percentage

Note: The current implementation focuses on tool calling accuracy and test case success rate.


📢 Test Case Format

Test cases are defined in JSON format:

[
    {
        "name": "Creative Task - Poetry Writing",
        "question": "Help me write a poem about missing someone",
        "should_call_api": false,
        "expected_tool_name": null,
        "test_type": "Avoid Unnecessary Calls"
    },
    {
        "name": "Real-time News Request - Today's News",
        "question": "What important news is there today?",
        "should_call_api": true,
        "expected_tool_name": "news_access_service_685c9b5c2de60791fbd5c7cc",
        "test_type": "Tool Name Correctness"
    }
]

📢 Contribute

We welcome contributions!

  • Add new test cases
  • Improve evaluation metrics
  • Enhance report generation
  • Add support for more LLM implementations

📜 License

MIT License


⚙️ Configuration

  • apps/config.json: Application and argument configuration
  • models/config.json: Model list and API info
  • tools/: Tool definitions (English & Chinese versions)

Example: apps/config.json

{
    "mfcs-python": {
        "command": "python",
        "stream": true,
        "args": [
            "apps/mfcs-python/mfcs-python.py",
            "--model_config=./models/config.json",
            "--tools=./tools/agent_tools_en.json",
            "--test_cases=./test_cases/test_progress_en.json",
            "--report_language=en"
        ]
    }
}

Example: models/config.json

{
    "moonshot-v1-8k": {
        "name": "Kimi-8k",
        "api_base": "https://api.moonshot.cn/v1",
        "api_key": "your-api-key-here"
    }
}

Example: tools/agent_tools_en.json

[
    {
        "parameters": {
            "type": "object",
            "properties": {
                "content": {
                    "type": "string",
                    "description": "The content of the message, supporting various types of content including plain text, multimodal (mixed input of text, images, files), and other formats."
                }
            },
            "required": ["content"]
        },
        "name": "elder_film_service_685c9b642de60791fbd5c7d2",
        "description": "A professional cinema guide with extensive film knowledge, capable of recommending suitable movies for users, answering various film-related questions, and providing comprehensive movie viewing guidance and services using clear, vivid, and easy-to-understand language."
    }
]

🏃 Command Line Usage

Benchmark Runner

# Basic usage
python run_benchmark.py

# Advanced options
python run_benchmark.py --verbose --config apps/config.json

Batch Execution

The benchmark uses optimized batch execution for optimal performance:

  • Fixed batch size: 10 concurrent test cases per batch
  • Performance impact: Significantly faster than sequential execution
  • Resource control: Prevents resource exhaustion from full concurrency
  • Optimized for: Most production environments

See BATCH_EXECUTION.md for detailed information about the batch execution mechanism.

Supported Key Arguments:

  • --config: Path to configuration file (default: apps/config.json)
  • --reports-dir: Directory to store reports (default: reports)
  • --verbose or -v: Enable verbose logging

Example:

python run_benchmark.py --config=apps/config.json --reports-dir=reports -v

Python Example

python apps/mfcs-python/mfcs-python.py --model=models/config.json --model_name=<model_id> --tools=tools/agent_tools_en.json --test_cases=test_cases/test_progress_en.json --test_index=<index>

Arguments:

  • --model: Path to model config
  • --model_name: Model ID
  • --tools: Path to tool config
  • --test_cases: Path to test cases file
  • --test_index: (optional) Specific test case index to run

📊 Evaluation & Reports

Batch evaluation of all models and test cases

  • Reports include: model, test case, accuracy, response time, tool usage, etc.
  • Reports are generated in the specified language (English or Chinese)
  • Markdown reports saved in reports/ with timestamp and language suffix
  • Report format: report_YYYYMMDD_HHMMSS_<language>.md

About

MFCS-Bench is a benchmark suite for evaluating large language models (LLMs) on function calling tasks based on the MFCS protocol. It standardizes the evaluation of how well different LLMs handle structured function calls, offering robust metrics and visualization tools to compare model performance across various tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors