🤖 Tool-Calling AI Agent with Semantic Evaluation

How to build an AI agent with function calling capabilities and LLM-based evaluation - built with LangChain, LangGraph, and vLLM.

🎯 What This Demonstrates

This tutorial notebook shows the complete pipeline for building agents that make actual structured function calls. Includes:

✅ Structured Tool Calling: Agent executes real functions with validated parameters
✅ Smart Routing: LangGraph workflow that decides when to call tools vs. return final answers
✅ Pydantic Validation: Type-safe tool schemas that guide LLM behavior
✅ Semantic Evaluation: LLM-based assessment that avoids keyword matching
✅ Production Patterns: Proper error handling, state management, and tool composition

🛠️ Technologies

LangChain - Tool abstraction and LLM integration framework
LangGraph - Workflow orchestration with state graphs
vLLM - High-performance local model serving with function calling support
Hermes-3-Llama-3.1-8B - Function-calling capable open-source model
Pydantic - Schema validation and structured outputs
Python 3.11+ - Core implementation

🏗️ Architecture

Single LLM, Dual Roles:

Agent Mode: llm_with_tools - generates structured tool calls
Evaluator Mode: llm (raw) - performs semantic assessment

This hybrid approach runs on a 24GB GPU by using one model instance for both agent execution and evaluation, avoiding the memory overhead of loading two separate models. Ideally, one should use 2 different models but due to memory constraints only one model was used.

Core Concepts

Why model selection matters - Function calling requires specifically trained models
Tool binding workflow - How to properly attach tools to LLMs (commonly skipped by coding agents)
Pydantic schema design - Creating tool definitions that guide LLM behavior
Docstring importance - How comprehensive docstrings determine agent reliability
State graph patterns - Building agent workflows with conditional routing

Implementation Details

Setting up vLLM with tool-calling flags for Hermes models
Defining tools with proper validation and error handling
Building state graphs with the current LangGraph API
Implementing semantic evaluation vs. keyword matching
Managing conversation state and context across turns

🚀 Quick Start

Prerequisites

Python 3.11 or higher
CUDA-capable GPU with 24GB VRAM (for local deployment)
Jupyter Notebook or JupyterLab

Installation

# Install dependencies
pip install langchain langchain-openai langgraph pydantic

# Install vLLM (requires CUDA)
pip install vllm

Start vLLM Server

In a separate terminal, start the vLLM server with function-calling support:

vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
  --port 8082 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192

Flag explanations:

--enable-auto-tool-choice: Enables automatic tool calling
--tool-call-parser hermes: Uses Hermes-specific parser for structured calls
--max-model-len 8192: Limits context window to fit in 24GB GPU memory

Wait for "Application startup complete" message before running the notebook.

Run the Notebook

jupyter notebook simple_tool_agent.ipynb

Execute cells sequentially. The notebook includes:

Connection setup
Tool definitions (weather, calculator)
Agent graph construction
Combined testing with message flow and evaluation output

🔧 Implemented Tools

Weather Tool

Returns mock weather data for any location
Demonstrates single-parameter tool with string validation
Shows proper Pydantic schema with length constraints

Calculator Tool

Performs basic arithmetic (add, subtract, multiply, divide)
Demonstrates multi-parameter tool with type enforcement
Includes error handling for invalid operations and division by zero

Both tools follow production patterns:

Complete docstrings with parameter descriptions
Pydantic validation schemas
Natural language return values
Proper error messages

📊 Evaluation System

The notebook includes a complete evaluation framework that assesses:

Tool Selection: Did the agent choose appropriate tools?
Response Quality: Is the final answer clear and complete?
Overall Success: Does the response address the user's question?

Why LLM-based evaluation?

Understands semantic variations ("multiply" vs "multiplied" vs "times")
Avoids keyword matching that breaks on rephrasing
Provides reasoned assessment with explanations
Scales to complex tool compositions

⚠️ Important Notes

Model Selection is Critical

This will NOT work with general instruction models. Function calling requires models specifically trained for structured tool use:

✅ Function-calling capable:

Hermes-3-Llama-3.1-8B (used here)
GPT-4, GPT-3.5-turbo
Claude 3+ models
Mistral-Large, Mixtral-8x7B-Instruct

❌ Lacks function calling:

Mistral-7B-Instruct
Base Llama models
Most general chat models

Wrong model = text descriptions instead of actual function calls.

AI Coding Assistants Generate Outdated Code

Warning: Many LLM-based coding assistants (Claude, ChatGPT, etc.) generate LangGraph code using the legacy API:

# Legacy (outdated)
graph_builder.set_entry_point("agent")
lambda x: "tools" if x else "__end__"

This notebook uses the current API:

# Current (correct)
graph_builder.add_edge(START, "agent")
lambda x: "tools" if x else END

🔄 From Tutorial to Production

This notebook provides patterns for building agents that interact with:

🌐 REST APIs - Web services, external data sources
🗄️ Databases - SQL queries, data retrieval
📊 Analytics Tools - Data processing, visualization
🔧 System Operations - File management, command execution
💼 Business Systems - CRM, ERP, ticketing systems

Key extension points:

Add more tools by following the Pydantic + docstring pattern
Implement authentication for external APIs
Add retry logic and rate limiting
Extend evaluation criteria for domain-specific requirements

📁 Repository Structure

.
├── simple_tool_agent.ipynb  # Main tutorial notebook
├── README.md                # This file
└── LICENSE                  # MIT License

🎓 Learning Path

Recommended order:

Read through the complete notebook markdown
Understand the architecture overview and why each component exists
Start vLLM server and verify it loads successfully
Execute notebook cells one by one, reading output carefully
Experiment by modifying test questions
Try adding a new tool using the established patterns

📝 License

MIT License - See LICENSE file for details.

🏷️ Keywords

AI agents LangChain LangGraph function calling tool calling structured outputs Pydantic validation LLM evaluation Hermes-3 vLLM semantic evaluation agent workflows state graphs

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
simple_tool_agent.ipynb		simple_tool_agent.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Tool-Calling AI Agent with Semantic Evaluation

🎯 What This Demonstrates

🛠️ Technologies

🏗️ Architecture

Core Concepts

Implementation Details

🚀 Quick Start

Prerequisites

Installation

Start vLLM Server

Run the Notebook

🔧 Implemented Tools

Weather Tool

Calculator Tool

📊 Evaluation System

⚠️ Important Notes

Model Selection is Critical

AI Coding Assistants Generate Outdated Code

🔄 From Tutorial to Production

📁 Repository Structure

🎓 Learning Path

📝 License

🏷️ Keywords

About

Uh oh!

Releases

Packages

Languages

License

dataville/simple_tool_calling_agent

Folders and files

Latest commit

History

Repository files navigation

🤖 Tool-Calling AI Agent with Semantic Evaluation

🎯 What This Demonstrates

🛠️ Technologies

🏗️ Architecture

Core Concepts

Implementation Details

🚀 Quick Start

Prerequisites

Installation

Start vLLM Server

Run the Notebook

🔧 Implemented Tools

Weather Tool

Calculator Tool

📊 Evaluation System

⚠️ Important Notes

Model Selection is Critical

AI Coding Assistants Generate Outdated Code

🔄 From Tutorial to Production

📁 Repository Structure

🎓 Learning Path

📝 License

🏷️ Keywords

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages