Skip to content

Atri2-code/Forge-Multi-Agent-Orchestration-Framework

Repository files navigation

⚡ Forge — Multi-Agent Orchestration Framework

Production-grade ReAct agent framework with multi-agent coordination, structured tool calling, hallucination recovery, and automated LLM-as-a-Judge evaluation. Built to understand the full agentic AI stack from prompt engineering to evaluation pipelines.

Python Tests Anthropic License


🧭 About

Forge is a multi-agent orchestration framework that implements the full agentic AI stack — from prompt engineering and tool call architecture through multi-agent coordination to automated quality evaluation.

Three components, one connected system:

  1. ReAct Agent — Reason + Act loop: agent thinks, selects a tool, observes the result, thinks again. Handles hallucinated tool names and invalid inputs gracefully without crashing.
  2. Planner Agent — decomposes complex tasks into subtasks using few-shot prompted LLM decomposition, spawns ReAct executor agents for each, and synthesises a coherent final answer.
  3. Evaluator — automated LLM-as-a-Judge pipeline that scores agent traces on correctness, reasoning quality, and tool use efficiency — enabling rapid iteration on prompts and architecture.

✨ Key Design Decisions

Why ReAct over Chain-of-Thought?

CoT generates reasoning in a single pass — it cannot verify facts or correct course. ReAct agents receive real tool outputs as observations, updating their belief state at each step. A CoT agent that hallucinates has no recovery mechanism. A ReAct agent gets an error observation and reasons about it.

Why structured tool schemas?

When an agent hallucinates a tool name, the registry catches it before execution and returns an informative error observation — allowing the agent to recover gracefully rather than crashing. This mirrors how production agentic platforms validate tool calls against registered action schemas before execution.

Why LLM-as-a-Judge evaluation?

Human evaluation of agent traces is slow and inconsistent. Automated scoring provides fast, reproducible quality metrics across correctness (50%), reasoning quality (30%), and tool efficiency (20%) — enabling rapid iteration without manual review.

Prompt engineering principles applied

  • Role definition — explicit agent identity and behavioural constraints in the system prompt
  • Few-shot examples — three examples covering normal operation, knowledge lookup, and hallucination recovery
  • Format enforcement — exact Thought / Action / Action Input / Observation / Final Answer structure
  • Recovery instructions — explicit guidance for what to do when a tool returns an error
  • Temperature = 0 — deterministic outputs for agentic tasks where consistency matters; non-determinism at step 2 of an 8-step chain causes compounding errors

🏗️ Architecture

forge/
├── agents/
│   ├── react_agent.py         # Core ReAct loop — Thought/Action/Observation/Final Answer
│   └── planner_agent.py       # Multi-agent task decomposition, execution, and synthesis
├── tools/
│   └── registry.py            # Tool registry with JSON schema validation + 4 built-in tools
├── prompts/
│   └── system_prompts.py      # Few-shot system prompts for ReAct, Planner, and Evaluator agents
├── evaluation/
│   └── evaluator.py           # LLM-as-a-Judge automated scoring pipeline
├── tests/
│   └── test_agentforge.py     # 24 tests — all passing
├── main.py                    # Demo: single agent, multi-agent, evaluation
└── requirements.txt

🚀 Getting Started

git clone https://github.com/Atri2-code/Forge
cd forge
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_key_here
python main.py

🔧 Built-in Tools

Tool Input Description
calculator expression string Safe arithmetic evaluator using Python's math module
lookup {"query": "concept"} Knowledge base lookup for technical concepts
summarise {"text": "long text"} Extractive summarisation of long tool outputs
get_date any string Returns current UTC date and time

Custom tools register via ToolRegistry.register(Tool(...)) with a name, description, JSON schema, function, and few-shot example.


📊 ReAct Loop Example

Task: What is 15% of 840, and what is that squared?

Thought: I need to calculate 15% of 840 first.
Action: calculator
Action Input: 0.15 * 840

Observation: Result: 126.0

Thought: Now I need to square 126.
Action: calculator
Action Input: 126 ** 2

Observation: Result: 15876

Thought: I have both results.
Final Answer: 15% of 840 is 126. Squared, that is 15,876.

🛡️ Hallucination Recovery

Action: web_search
Action Input: "latest AI news"

Observation: Error: Tool 'web_search' does not exist.
Available tools: calculator, lookup, summarise, get_date.
Please use one of the available tools.

Thought: I don't have a web search tool. I'll use what's available.
Action: get_date
Action Input: "now"
...

The registry catches hallucinated tool names before execution and returns an informative observation — the agent reasons about the error and recovers without crashing.


🧪 Tests

pytest tests/ -v

24 tests covering:

  • Tool registry: registration, lookup, schema validation, missing fields, wrong types
  • Response parser: thought extraction, action parsing, JSON input handling, final answer detection, empty response
  • Hallucination recovery: unknown tool names, invalid inputs, error observation content
  • Integration: single-step final answer, multi-step tool use (mocked Anthropic API)

💡 What I Learned

  • Why ReAct's observe-reason-act loop outperforms single-pass CoT for multi-step tasks requiring external state
  • How structured tool schemas enable hallucination recovery — errors become observations, not crashes
  • Why temperature=0 is critical for agentic tasks — non-determinism at any step compounds errors downstream
  • How few-shot examples in system prompts shape format adherence more reliably than instructions alone
  • Why LLM-as-a-Judge evaluation needs explicit scoring rubrics to produce consistent, reproducible scores

📌 Topics

python llm agents agentic-ai react prompt-engineering tool-calling multi-agent orchestration evaluation anthropic claude


📄 License

MIT

About

ReAct agent framework with multi-agent coordination, structured tool calling, JSON schema validation, and hallucination recovery. Automated LLM-as-a-Judge evaluation pipeline scoring correctness, reasoning quality, and tool efficiency. Built with few-shot prompt engineering and a custom tool registry.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages