⚡ Forge — Multi-Agent Orchestration Framework

Production-grade ReAct agent framework with multi-agent coordination, structured tool calling, hallucination recovery, and automated LLM-as-a-Judge evaluation. Built to understand the full agentic AI stack from prompt engineering to evaluation pipelines.

🧭 About

Forge is a multi-agent orchestration framework that implements the full agentic AI stack — from prompt engineering and tool call architecture through multi-agent coordination to automated quality evaluation.

Three components, one connected system:

ReAct Agent — Reason + Act loop: agent thinks, selects a tool, observes the result, thinks again. Handles hallucinated tool names and invalid inputs gracefully without crashing.
Planner Agent — decomposes complex tasks into subtasks using few-shot prompted LLM decomposition, spawns ReAct executor agents for each, and synthesises a coherent final answer.
Evaluator — automated LLM-as-a-Judge pipeline that scores agent traces on correctness, reasoning quality, and tool use efficiency — enabling rapid iteration on prompts and architecture.

✨ Key Design Decisions

Why ReAct over Chain-of-Thought?

CoT generates reasoning in a single pass — it cannot verify facts or correct course. ReAct agents receive real tool outputs as observations, updating their belief state at each step. A CoT agent that hallucinates has no recovery mechanism. A ReAct agent gets an error observation and reasons about it.

Why structured tool schemas?

When an agent hallucinates a tool name, the registry catches it before execution and returns an informative error observation — allowing the agent to recover gracefully rather than crashing. This mirrors how production agentic platforms validate tool calls against registered action schemas before execution.

Why LLM-as-a-Judge evaluation?

Human evaluation of agent traces is slow and inconsistent. Automated scoring provides fast, reproducible quality metrics across correctness (50%), reasoning quality (30%), and tool efficiency (20%) — enabling rapid iteration without manual review.

Prompt engineering principles applied

Role definition — explicit agent identity and behavioural constraints in the system prompt
Few-shot examples — three examples covering normal operation, knowledge lookup, and hallucination recovery
Format enforcement — exact Thought / Action / Action Input / Observation / Final Answer structure
Recovery instructions — explicit guidance for what to do when a tool returns an error
Temperature = 0 — deterministic outputs for agentic tasks where consistency matters; non-determinism at step 2 of an 8-step chain causes compounding errors

🏗️ Architecture

forge/
├── agents/
│   ├── react_agent.py         # Core ReAct loop — Thought/Action/Observation/Final Answer
│   └── planner_agent.py       # Multi-agent task decomposition, execution, and synthesis
├── tools/
│   └── registry.py            # Tool registry with JSON schema validation + 4 built-in tools
├── prompts/
│   └── system_prompts.py      # Few-shot system prompts for ReAct, Planner, and Evaluator agents
├── evaluation/
│   └── evaluator.py           # LLM-as-a-Judge automated scoring pipeline
├── tests/
│   └── test_agentforge.py     # 24 tests — all passing
├── main.py                    # Demo: single agent, multi-agent, evaluation
└── requirements.txt

🚀 Getting Started

git clone https://github.com/Atri2-code/Forge
cd forge
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_key_here
python main.py

🔧 Built-in Tools

Tool	Input	Description
`calculator`	expression string	Safe arithmetic evaluator using Python's `math` module
`lookup`	`{"query": "concept"}`	Knowledge base lookup for technical concepts
`summarise`	`{"text": "long text"}`	Extractive summarisation of long tool outputs
`get_date`	any string	Returns current UTC date and time

Custom tools register via ToolRegistry.register(Tool(...)) with a name, description, JSON schema, function, and few-shot example.

📊 ReAct Loop Example

Task: What is 15% of 840, and what is that squared?

Thought: I need to calculate 15% of 840 first.
Action: calculator
Action Input: 0.15 * 840

Observation: Result: 126.0

Thought: Now I need to square 126.
Action: calculator
Action Input: 126 ** 2

Observation: Result: 15876

Thought: I have both results.
Final Answer: 15% of 840 is 126. Squared, that is 15,876.

🛡️ Hallucination Recovery

Action: web_search
Action Input: "latest AI news"

Observation: Error: Tool 'web_search' does not exist.
Available tools: calculator, lookup, summarise, get_date.
Please use one of the available tools.

Thought: I don't have a web search tool. I'll use what's available.
Action: get_date
Action Input: "now"
...

The registry catches hallucinated tool names before execution and returns an informative observation — the agent reasons about the error and recovers without crashing.

🧪 Tests

pytest tests/ -v

24 tests covering:

Tool registry: registration, lookup, schema validation, missing fields, wrong types
Response parser: thought extraction, action parsing, JSON input handling, final answer detection, empty response
Hallucination recovery: unknown tool names, invalid inputs, error observation content
Integration: single-step final answer, multi-step tool use (mocked Anthropic API)

💡 What I Learned

Why ReAct's observe-reason-act loop outperforms single-pass CoT for multi-step tasks requiring external state
How structured tool schemas enable hallucination recovery — errors become observations, not crashes
Why temperature=0 is critical for agentic tasks — non-determinism at any step compounds errors downstream
How few-shot examples in system prompts shape format adherence more reliably than instructions alone
Why LLM-as-a-Judge evaluation needs explicit scoring rubrics to produce consistent, reproducible scores

📌 Topics

python llm agents agentic-ai react prompt-engineering tool-calling multi-agent orchestration evaluation anthropic claude

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
CACHEDIR.TAG		CACHEDIR.TAG
README.md		README.md
__init__.py		__init__.py
builtin_tools.py		builtin_tools.py
evaluator.py		evaluator.py
lastfailed		lastfailed
llm_client.py		llm_client.py
main.py		main.py
nodeids		nodeids
planner_agent.py		planner_agent.py
react_agent.py		react_agent.py
registry.py		registry.py
requirements.txt		requirements.txt
system_prompts.py		system_prompts.py
test_agent.py		test_agent.py
test_agentforge.py		test_agentforge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ Forge — Multi-Agent Orchestration Framework

🧭 About

✨ Key Design Decisions

Why ReAct over Chain-of-Thought?

Why structured tool schemas?

Why LLM-as-a-Judge evaluation?

Prompt engineering principles applied

🏗️ Architecture

🚀 Getting Started

🔧 Built-in Tools

📊 ReAct Loop Example

🛡️ Hallucination Recovery

🧪 Tests

💡 What I Learned

📌 Topics

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ Forge — Multi-Agent Orchestration Framework

🧭 About

✨ Key Design Decisions

Why ReAct over Chain-of-Thought?

Why structured tool schemas?

Why LLM-as-a-Judge evaluation?

Prompt engineering principles applied

🏗️ Architecture

🚀 Getting Started

🔧 Built-in Tools

📊 ReAct Loop Example

🛡️ Hallucination Recovery

🧪 Tests

💡 What I Learned

📌 Topics

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages