Production-grade ReAct agent framework with multi-agent coordination, structured tool calling, hallucination recovery, and automated LLM-as-a-Judge evaluation. Built to understand the full agentic AI stack from prompt engineering to evaluation pipelines.
Forge is a multi-agent orchestration framework that implements the full agentic AI stack — from prompt engineering and tool call architecture through multi-agent coordination to automated quality evaluation.
Three components, one connected system:
- ReAct Agent — Reason + Act loop: agent thinks, selects a tool, observes the result, thinks again. Handles hallucinated tool names and invalid inputs gracefully without crashing.
- Planner Agent — decomposes complex tasks into subtasks using few-shot prompted LLM decomposition, spawns ReAct executor agents for each, and synthesises a coherent final answer.
- Evaluator — automated LLM-as-a-Judge pipeline that scores agent traces on correctness, reasoning quality, and tool use efficiency — enabling rapid iteration on prompts and architecture.
CoT generates reasoning in a single pass — it cannot verify facts or correct course. ReAct agents receive real tool outputs as observations, updating their belief state at each step. A CoT agent that hallucinates has no recovery mechanism. A ReAct agent gets an error observation and reasons about it.
When an agent hallucinates a tool name, the registry catches it before execution and returns an informative error observation — allowing the agent to recover gracefully rather than crashing. This mirrors how production agentic platforms validate tool calls against registered action schemas before execution.
Human evaluation of agent traces is slow and inconsistent. Automated scoring provides fast, reproducible quality metrics across correctness (50%), reasoning quality (30%), and tool efficiency (20%) — enabling rapid iteration without manual review.
- Role definition — explicit agent identity and behavioural constraints in the system prompt
- Few-shot examples — three examples covering normal operation, knowledge lookup, and hallucination recovery
- Format enforcement — exact Thought / Action / Action Input / Observation / Final Answer structure
- Recovery instructions — explicit guidance for what to do when a tool returns an error
- Temperature = 0 — deterministic outputs for agentic tasks where consistency matters; non-determinism at step 2 of an 8-step chain causes compounding errors
forge/
├── agents/
│ ├── react_agent.py # Core ReAct loop — Thought/Action/Observation/Final Answer
│ └── planner_agent.py # Multi-agent task decomposition, execution, and synthesis
├── tools/
│ └── registry.py # Tool registry with JSON schema validation + 4 built-in tools
├── prompts/
│ └── system_prompts.py # Few-shot system prompts for ReAct, Planner, and Evaluator agents
├── evaluation/
│ └── evaluator.py # LLM-as-a-Judge automated scoring pipeline
├── tests/
│ └── test_agentforge.py # 24 tests — all passing
├── main.py # Demo: single agent, multi-agent, evaluation
└── requirements.txt
git clone https://github.com/Atri2-code/Forge
cd forge
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_key_here
python main.py| Tool | Input | Description |
|---|---|---|
calculator |
expression string | Safe arithmetic evaluator using Python's math module |
lookup |
{"query": "concept"} |
Knowledge base lookup for technical concepts |
summarise |
{"text": "long text"} |
Extractive summarisation of long tool outputs |
get_date |
any string | Returns current UTC date and time |
Custom tools register via ToolRegistry.register(Tool(...)) with a name, description, JSON schema, function, and few-shot example.
Task: What is 15% of 840, and what is that squared?
Thought: I need to calculate 15% of 840 first.
Action: calculator
Action Input: 0.15 * 840
Observation: Result: 126.0
Thought: Now I need to square 126.
Action: calculator
Action Input: 126 ** 2
Observation: Result: 15876
Thought: I have both results.
Final Answer: 15% of 840 is 126. Squared, that is 15,876.
Action: web_search
Action Input: "latest AI news"
Observation: Error: Tool 'web_search' does not exist.
Available tools: calculator, lookup, summarise, get_date.
Please use one of the available tools.
Thought: I don't have a web search tool. I'll use what's available.
Action: get_date
Action Input: "now"
...
The registry catches hallucinated tool names before execution and returns an informative observation — the agent reasons about the error and recovers without crashing.
pytest tests/ -v24 tests covering:
- Tool registry: registration, lookup, schema validation, missing fields, wrong types
- Response parser: thought extraction, action parsing, JSON input handling, final answer detection, empty response
- Hallucination recovery: unknown tool names, invalid inputs, error observation content
- Integration: single-step final answer, multi-step tool use (mocked Anthropic API)
- Why ReAct's observe-reason-act loop outperforms single-pass CoT for multi-step tasks requiring external state
- How structured tool schemas enable hallucination recovery — errors become observations, not crashes
- Why temperature=0 is critical for agentic tasks — non-determinism at any step compounds errors downstream
- How few-shot examples in system prompts shape format adherence more reliably than instructions alone
- Why LLM-as-a-Judge evaluation needs explicit scoring rubrics to produce consistent, reproducible scores
python llm agents agentic-ai react prompt-engineering tool-calling multi-agent orchestration evaluation anthropic claude
MIT