How to build an AI agent with function calling capabilities and LLM-based evaluation - built with LangChain, LangGraph, and vLLM.
This tutorial notebook shows the complete pipeline for building agents that make actual structured function calls. Includes:
- ✅ Structured Tool Calling: Agent executes real functions with validated parameters
- ✅ Smart Routing: LangGraph workflow that decides when to call tools vs. return final answers
- ✅ Pydantic Validation: Type-safe tool schemas that guide LLM behavior
- ✅ Semantic Evaluation: LLM-based assessment that avoids keyword matching
- ✅ Production Patterns: Proper error handling, state management, and tool composition
- LangChain - Tool abstraction and LLM integration framework
- LangGraph - Workflow orchestration with state graphs
- vLLM - High-performance local model serving with function calling support
- Hermes-3-Llama-3.1-8B - Function-calling capable open-source model
- Pydantic - Schema validation and structured outputs
- Python 3.11+ - Core implementation
Single LLM, Dual Roles:
- Agent Mode:
llm_with_tools- generates structured tool calls - Evaluator Mode:
llm(raw) - performs semantic assessment
This hybrid approach runs on a 24GB GPU by using one model instance for both agent execution and evaluation, avoiding the memory overhead of loading two separate models. Ideally, one should use 2 different models but due to memory constraints only one model was used.
- Why model selection matters - Function calling requires specifically trained models
- Tool binding workflow - How to properly attach tools to LLMs (commonly skipped by coding agents)
- Pydantic schema design - Creating tool definitions that guide LLM behavior
- Docstring importance - How comprehensive docstrings determine agent reliability
- State graph patterns - Building agent workflows with conditional routing
- Setting up vLLM with tool-calling flags for Hermes models
- Defining tools with proper validation and error handling
- Building state graphs with the current LangGraph API
- Implementing semantic evaluation vs. keyword matching
- Managing conversation state and context across turns
- Python 3.11 or higher
- CUDA-capable GPU with 24GB VRAM (for local deployment)
- Jupyter Notebook or JupyterLab
# Install dependencies
pip install langchain langchain-openai langgraph pydantic
# Install vLLM (requires CUDA)
pip install vllmIn a separate terminal, start the vLLM server with function-calling support:
vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
--port 8082 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192Flag explanations:
--enable-auto-tool-choice: Enables automatic tool calling--tool-call-parser hermes: Uses Hermes-specific parser for structured calls--max-model-len 8192: Limits context window to fit in 24GB GPU memory
Wait for "Application startup complete" message before running the notebook.
jupyter notebook simple_tool_agent.ipynbExecute cells sequentially. The notebook includes:
- Connection setup
- Tool definitions (weather, calculator)
- Agent graph construction
- Combined testing with message flow and evaluation output
- Returns mock weather data for any location
- Demonstrates single-parameter tool with string validation
- Shows proper Pydantic schema with length constraints
- Performs basic arithmetic (add, subtract, multiply, divide)
- Demonstrates multi-parameter tool with type enforcement
- Includes error handling for invalid operations and division by zero
Both tools follow production patterns:
- Complete docstrings with parameter descriptions
- Pydantic validation schemas
- Natural language return values
- Proper error messages
The notebook includes a complete evaluation framework that assesses:
- Tool Selection: Did the agent choose appropriate tools?
- Response Quality: Is the final answer clear and complete?
- Overall Success: Does the response address the user's question?
Why LLM-based evaluation?
- Understands semantic variations ("multiply" vs "multiplied" vs "times")
- Avoids keyword matching that breaks on rephrasing
- Provides reasoned assessment with explanations
- Scales to complex tool compositions
This will NOT work with general instruction models. Function calling requires models specifically trained for structured tool use:
✅ Function-calling capable:
- Hermes-3-Llama-3.1-8B (used here)
- GPT-4, GPT-3.5-turbo
- Claude 3+ models
- Mistral-Large, Mixtral-8x7B-Instruct
❌ Lacks function calling:
- Mistral-7B-Instruct
- Base Llama models
- Most general chat models
Wrong model = text descriptions instead of actual function calls.
Warning: Many LLM-based coding assistants (Claude, ChatGPT, etc.) generate LangGraph code using the legacy API:
# Legacy (outdated)
graph_builder.set_entry_point("agent")
lambda x: "tools" if x else "__end__"This notebook uses the current API:
# Current (correct)
graph_builder.add_edge(START, "agent")
lambda x: "tools" if x else ENDThis notebook provides patterns for building agents that interact with:
- 🌐 REST APIs - Web services, external data sources
- 🗄️ Databases - SQL queries, data retrieval
- 📊 Analytics Tools - Data processing, visualization
- 🔧 System Operations - File management, command execution
- 💼 Business Systems - CRM, ERP, ticketing systems
Key extension points:
- Add more tools by following the Pydantic + docstring pattern
- Implement authentication for external APIs
- Add retry logic and rate limiting
- Extend evaluation criteria for domain-specific requirements
.
├── simple_tool_agent.ipynb # Main tutorial notebook
├── README.md # This file
└── LICENSE # MIT License
Recommended order:
- Read through the complete notebook markdown
- Understand the architecture overview and why each component exists
- Start vLLM server and verify it loads successfully
- Execute notebook cells one by one, reading output carefully
- Experiment by modifying test questions
- Try adding a new tool using the established patterns
MIT License - See LICENSE file for details.
AI agents LangChain LangGraph function calling tool calling structured outputs Pydantic validation LLM evaluation Hermes-3 vLLM semantic evaluation agent workflows state graphs