Atomadic Forge is an architecture compiler: a deterministic, LLM-free tool that scores, enforces, and certifies the structural quality of any Python codebase. It does not compete with Cursor, Aider, Codex, or Devin — it is the certainty layer those tools are missing. AI coding agents generate fast; Forge tells them whether what they generated is structurally correct. The result is a tight loop: agent writes, Forge certifies, agent fixes, Forge re-certifies. Together they ship.
Every AI coding tool faces the same problem after the first draft lands: Is this code actually correct? Not "does it look right" — is it wired correctly, tier-disciplined, importable, and testable in isolation?
LLMs cannot answer that reliably. They hallucinate import paths, invent missing modules, and produce code that looks reasonable but breaks on the second dependency. The gap between "generated" and "shippable" is the 2.4-trillion-dollar global tech-debt problem.
Forge fills that gap with deterministic analysis:
- Recon — classifies every symbol in a repo into one of 5 monadic tiers without running any LLM.
- Wire — finds every upward-import violation in O(n) AST time.
- Certify — returns a single 0-100 score covering documentation, tests, tier layout, and import discipline. Agent-callable go/no-go gate.
- Enforce — plans and applies mechanical fixes for wire violations, rolls back anything that increases the violation count.
The architectural boundary is clean: the calling agent supplies the LLM reasoning. Forge supplies analytical certainty — parsing, tier classification, wire checking, certify scoring. Neither side tries to do the other's job.
This is a build log. Every claim below is grounded in committed code.
Three bugs ported from the Forge Deluxe sibling repo, plus a registry usability improvement that touches all 29 tools.
Bug 1: enforce destination path
The absorb pipeline emitted files to <pkg>/aN_*/ instead of
src/<pkg>/aN_*/ when the output directory had an src/ wrapper. Teams
absorbing legacy repos into an existing src/ layout got a broken import
tree. Fix: _build_dest_path in a1_at_functions/enforce_planner.py now
normalizes the destination preserving the source's parent path including the
src/ prefix. Windows backslashes are normalized to forward slashes so the
output is cross-OS portable.
Bug 2: dedup defensive caps
dedup_engine in a3_og_features/dedup_engine.py walked the corpus twice —
once to collect files, once to count them. On an 821-file cross-package
corpus, this doubled I/O and silently ignored the max_files truncation
marker. Callers got a files_scanned number that did not match what was
actually analyzed. Fix: the inner functions now return the walked count
directly so run_dedup uses it without a second rglob pass.
Caps added: DEDUP_MAX_FILES = 5000, DEDUP_MAX_DUPLICATES_PER_GROUP = 64.
Bug 3: pytest parser regression
_parse_pytest_summary in a1_at_functions/test_runner.py used a
single-pass regex that choked on xfailed/xpassed status words between
"passed" and "in Xs". The certify behavioral subscore reported ran: false, pass_ratio: 0.0 on repos using parametrized marks — the score lied.
Fix: independent per-metric regex patterns for each status word.
Registry improvement: 29 "Use this when:" trigger prefixes
Every entry in the TOOLS dict in mcp_protocol.py now starts with a
one-liner that names the correct trigger condition. Before this change, an
agent had to read the full description to know whether auto_plan or
auto_apply was the right call. After: the first sentence of each
description is a decision criterion. Agent tool-selection overhead drops
from "read and reason about N descriptions" to "match the trigger phrase."
Validation at commit: 968 tests passing, certify 100/100, wire 0 violations.
Seven reliability fixes targeting the failure class most visible to
integration teams: MCP error -32000 Connection closed.
Fix 1: broad dispatcher except clause
dispatch_request in mcp_protocol.py previously caught only
(ValueError, OSError, RuntimeError) at the tool-call boundary. A
TypeError from signature drift — a parameter type mismatch in any tool
handler — escaped the catch, crashed the stdio loop, and produced the
dreaded -32000 Connection closed error. The calling agent had no recovery
path. Fix: the handler is now wrapped in except Exception, with a comment
in the source explaining exactly why broad catching is correct at the
transport boundary (callers never see a Python traceback; errors surface as
JSON-RPC -32000 error responses that agents can inspect and retry).
Fix 2: explicit _json_default
The MCP wire previously used json.dumps(..., default=str). That silently
corrupted set/frozenset (rendered as "{1, 2, 3}" instead of [1, 2, 3]),
bytes (rendered as "b'\\xff...'" instead of base64), and dataclasses
(rendered as "Foo(x=5)" — field access lost). No shipped tool emitted these
types at the time, but any future tool that returns structured Python types
would silently break agent consumers. Fix: _json_default in mcp_protocol.py
explicitly handles set/frozenset, bytes/bytearray, Path, and dataclasses
with correct round-trip encodings. It is a forward-compatibility substrate
guarantee.
Fix 3: score_patch returns REFINE on malformed input
Documented in FORGE_DOGFOOD_NOTES.md §G-7: calling score_patch with a
diff fragment that lacked the diff --git header context returned
file_count: 0, needs_human_review: false, verdict: PASS. An agent that
trusts the verdict merges an unscored patch. Fix: when file_count == 0,
patch_scorer.py now sets needs_human_review = true and emits
verdict: REFINE. Silent green on nothing is not a passing grade.
Fix 4: contract tests for JSON round-trip
Every tool handler in the TOOLS registry is now exercised by a contract test
that asserts the return value is json.dumps-safe. If any future tool
returns a non-serializable type, the test suite catches it before it reaches
the MCP wire.
Fixes 5-7: _build_dest_path backslash normalization, dedup_engine
double-walk, _parse_pytest_summary plural/singular forms
These are continuation hardening of the three bugs from commit 42f35bc,
with regression tests added for each.
Validation at commit: 975 tests passing, certify 100/100, wire 0 violations.
Three new MCP tools bridging the CLI transmuter pipeline to agent consumers.
The CLI verbs forge auto, forge cherry, and forge finalize already
shipped and were battle-tested. Round 2 wraps them as agent-callable MCP
tools so Cursor, Claude Code, Aider, and any MCP client can invoke the full
pipeline in one call.
auto — flagship single-shot pipeline
scout → cherry → assimilate → wire → certify against any source repo.
Pass any directory; get back a tier-organized, wire-clean, certify-scored
package. No LLM involved inside Forge at any step. Agent supplies the target;
Forge supplies the analytical verdict.
cherry — surgical manifest builder
Reads or builds the latest scout report, applies a selection filter
(explicit names, pick_all, or only_tier), and writes a cherry.json
manifest. Gives the calling agent fine-grained control over what gets
absorbed before running finalize. Useful when absorbing a large legacy
repo and wanting only specific modules.
finalize — materialize + wire + certify
Takes a cherry-pick manifest and a target directory, assimilates the
selected symbols into a tier-organized package, then runs wire and certify.
Returns atomadic-forge.assimilate/v1 with the full certify score.
All three handlers follow the same a1↔a3 injection pattern already used by
enforce, auto_plan, auto_step, and auto_apply: the a1 dispatcher
holds an unbound stub that returns {wired: false} in test isolation; the
a3 mcp_server module registers the real handler at import time via
register_*_handler.
Total MCP tool count after Round 2: 32 (was 29).
Validation at commit: 983 tests passing, certify 100/100, wire 0 violations.
Most architecture linters find what is wrong. emergent_scan finds what
is missing — specifically, pairs and chains of existing functions where
one's output type already feeds the next's input but no code path wires them.
This is not LLM inference. It is deterministic type-flow analysis across the repo's AST. The result is a ranked list of latent compositions with a proposed a3 feature name and a one-call adapter snippet.
Proof of scale: In the atomadic-lang stress test, emergent_scan ran
against an 818-file merged super-package combining five Atomadic repos. It
surfaced 25 latent cross-domain emergent pipelines with a top score of 75/100.
These are compositions that NO single canonical repo expressed individually —
they became visible only because the merge put five packages in one tree.
This is portfolio-scale architecture archaeology. Nothing else does this.
The most expensive AI coding loop is the one that writes code, runs tests,
fails, rewrites, fails again, and then discovers the file was in the wrong
tier. preflight_change surfaces that error before the first write.
Pass an intent string and a list of proposed files. Get back the detected tier of each file, forbidden imports, likely affected tests, sibling files to read first, and whether the write scope is too broad. Most agent mistakes happen before code is written. Preflight makes them visible in advance.
A single integer from 0 to 100 covering documentation presence, test
coverage and pass rate, tier layout conformance, and import discipline.
With emit_receipt: true, it also emits a Receipt v1 artifact — a
verifiable trust artifact with a hash chain suitable for audit handoff.
The score is deterministic given the same repo state. Two agents, two machines, same code: same score. That is the property LLM-based quality assessment cannot provide.
certify --emit-receipt produces a structured receipt with schema version,
score breakdown, wire status, and a canonical hash of the lineage chain.
The receipt format is documented in docs/RECEIPT.md. Compliance teams,
procurement teams, and CI/CD pipelines can consume it as a structured
assertion of architecture quality at a point in time.
The dispatch_request function is the MCP transport boundary. It catches
exceptions from tool handlers and converts them to JSON-RPC error responses.
Before the fix, the catch tuple was (ValueError, OSError, RuntimeError).
A TypeError from any tool handler — wrong parameter type, unexpected
keyword argument, changed function signature during development — escaped
the boundary, terminated the stdio loop, and produced -32000 Connection closed. The agent got no actionable error. The fix uses except Exception
at this boundary, which is correct: the entire purpose of the boundary is
to prevent per-request errors from propagating to the transport. The
docstring in the fixed code explains why the broad catch is intentional.
Using default=str in json.dumps is a common shortcut that feels safe.
It is not safe at a protocol boundary. When a tool handler returns a set,
the caller gets "{1, 2, 3}" — a string representation of the set, not a
JSON array. The caller cannot iterate it. When a tool returns a dataclass,
the caller gets "Foo(x=5)" — field access is gone. The fix introduces
_json_default with explicit, correct encodings for every non-native Python
type: sorted list for sets, base64 dict for bytes, POSIX path string for
Path, and dataclasses.asdict for dataclasses. Any future tool that returns
a structured type gets correct encoding by construction.
score_patch is the PR-reviewer gate. When it returns verdict: PASS,
an agent is entitled to trust that the patch has been scored. Returning
verdict: PASS when the diff was malformed and no files were parsed is a
confidence failure — it is score_patch certifying that it did not score
anything. The fix: file_count == 0 now unconditionally sets
needs_human_review: true and emits verdict: REFINE. Malformed input
cannot produce a passing verdict.
The transmuter — scout → cherry → assimilate → wire → certify — has
shipped as a CLI pipeline since early in Forge's history. Round 2 makes
it agent-callable.
The key design point: no LLM runs inside Forge at any step of the pipeline. The pipeline is fully deterministic. The calling agent supplies the goal; Forge supplies the execution and the verdict. This keeps the substrate boundary clean and makes the pipeline repeatable across different agents, different LLM providers, and different execution environments.
Before Round 2, the pipeline was available only to developers running Forge
directly. After Round 2, any MCP-capable AI coding tool — Cursor, Claude
Code, Aider, Devin, a custom agent — can invoke auto and get a
tier-organized, wire-clean, certify-scored package in one call.
The three tools (auto, cherry, finalize) are wired using the same
injection pattern already proven for enforce and auto_plan: a1 holds a
typed stub, a3 registers the real handler at import time. Test isolation
works because the stub returns a structured {wired: false} response instead
of failing silently.
The current iterate and evolve CLI verbs hold an LLM client inside Forge.
The agent runs the outer loop but Forge manages the LLM calls internally.
This creates a substrate problem: the LLM provider, model selection, token
budget, and retry logic are coupled to Forge's internal implementation.
An agent using Cursor cannot substitute its own LLM. An agent with a custom
provider cannot plug in.
Round 3 moves the LLM call out of Forge entirely.
The new shape:
iterate_start(intent, output, package, seed_repo, language)— Forge returns the system prompt, the first code-generation prompt, and a session ID. The calling agent takes these and drives its own LLM.iterate_continue(session_id, response)— The calling agent returns the LLM's response. Forge parses the files, runs wire and certify, and returns the next prompt (if more iterations are needed) or a final verdict (if the target score is reached). No LLM invoked inside Forge.
The same pattern applies to evolve_start and evolve_step for the
recursive variant.
This is the architectural move that locks Forge into the substrate position: Forge becomes the analytical engine that every LLM-based coding tool wants to call between turns. The calling agent's LLM does generation. Forge does verification. The agent retries until Forge says it is done. Neither side needs to know the other's internals.
emergent_scan works within a single repo. Round 4 extends it to portfolios.
Two new tools:
recon_swarm(repos) — runs the scout walker across multiple repos and
merges the symbol manifests into a union inventory. Each symbol is tagged
with its source repo. The merge is the same deterministic AST walk, applied
at scale.
harvest(target, sources) — given a target repo and one or more
source repos, finds capabilities that the target lacks but that exist in the
sources, and proposes a graft plan: which symbols to cherry-pick, which tier
they belong to, and what the wire graph looks like after the graft.
The proof of concept already exists in the data: the atomadic-lang stress test
surfaced 25 cross-domain emergent pipelines in an 818-file merged super-package.
Those compositions were invisible to any per-repo scan. harvest makes
them systematically discoverable without requiring the manual merge step.
This is the feature that turns Forge from a single-repo architecture linter into a portfolio-scale composition engine.
Wire up forge mcp serve as a local MCP server. Every repo you open in
Cursor immediately has access to context_pack (first-call orientation),
preflight_change (pre-edit guardrail), score_patch (PR reviewer), and
certify (go/no-go gate). The agent does not need to read the codebase to
know which tier a file belongs to — Forge tells it.
Call forge certify . after each edit round. Aider's iteration loop gets
a structured 0-100 score it can use as a convergence criterion. When certify
reports a wire violation, Aider can call forge wire --suggest-repairs to
get the exact fix plan without running the LLM again.
The auto MCP tool is the single-call absorb pipeline. Point it at any
legacy repo and get back a tier-organized package ready for further agent
development. The substrate boundary is the key: Codex or Claude Code drives
the generation; Forge certifies the result. The agent loop converges because
the acceptance criterion is deterministic.
auto_plan generates a ranked set of action cards for any refactor goal.
Each card specifies kind, risk, write scope, and a next_command. Devin can
consume the plan, filter cards by risk, and execute the applyable ones via
auto_apply. Forge rolls back any card that increases the violation count.
Three lines to integrate:
pip install atomadic-forge
forge init
forge mcp serve --project-root .Add the stdio MCP server to your agent config. Call tools/list to see all
32 available tools. Start with context_pack for first-call orientation.
The self-application demo: Forge's own codebase as the target for
harvest, auto_plan, and the Round 3 iterate loop. The agent
generates improvements to Forge; Forge certifies them; the loop closes.
This is the recursive self-improvement demonstration that turns the
architecture-compiler thesis into a live proof.
The substrate invariant holds throughout: the agent drives the LLM, Forge drives the verification. Forge never certifies itself without an independent execution of the same certify tool.
This is the round that turns forge from an architecture-compiler-for-one-repo into a portfolio-level capability harvester. Two new MCP tools land, both bounded by codex-derived invariants:
mcp__atomadic-forge__recon_swarm # walk N≤23 repos → unified scout report
mcp__atomadic-forge__harvest # diff target vs union → graft proposals
The genuinely novel capability: forge surfaces capabilities your target repo lacks but your portfolio has — automatically, with no human in the loop until the verdict needs review. Every proposal carries an explicit trust verdict tied to a Lean-4-verified systemic-integrity threshold:
-
TAU_TRUST ≈ 0.9983543609...— Lean-4-verified sovereign invariant. Above this confidence, a proposal istrust_gated(auto-applyable). Below,review_required. Not a magic number — a proven threshold. -
D_MAX_DELEGATION = 23— Lean-4-verified hard cap on swarm size. Asking for 24+ repos returns-32000, never silently truncates. Forge fails loud. -
OMEGA_0 ≡ 0— Zero Unmapped Noise Floor. Every swarm report computesomega_residueand asserts the invariant holds. Non-zero residue means a per-repo walk corrupted state — surface it, never bury it.
These are imported with provenance in every docstring and run assertion-based self-checks at import time. Drift any anchor and Forge refuses to import — fail-fast, invariant-anchored. The derivation that produces these values is published separately under commercial license; the values themselves are auditable Lean-4 outputs.
This is the architectural commitment that makes forge worthy of integration with AAAA-Nexus (trust gate) and atomadic-lang (overfit-bounded tokenizer):
forge + lang + nexus = perfect code that's
trust-verified (Nexus), hallucination-free (codex anchors),
fully compliant (forge wire+certify), 4× less tokens (lang Φ_tri tokens).
| Commit | Round | Headline | TOOLS | Tests pass |
|---|---|---|---|---|
42f35bc |
R0 | 3 deluxe-ported bug fixes + 29 "Use this when:" descriptions | 29 | 968 |
27372ae |
R1 | Substrate hardening — broadened except, json default, score_patch REFINE | 29 | 975 |
d4bc0f3 |
R2 | Transmuter exposed — auto/cherry/finalize MCP | 32 | 983 |
1abc2b2 |
R3 | Agent-driven iterate — substrate boundary fix | 34 | 989 |
| (this) | R4 | Cross-repo cherry-hunter — MHED-trust-gated | 36 | 995 |
Every commit: ruff clean, forge wire 0 violations, forge certify 100/100 PASS.
The substrate boundary held all the way through. No commit constructed
an LLM client inside forge. The Round-3 contract test patches
resolve_default_client and asserts call count == 0 across iterate sessions.
That guarantee is now part of the test suite — anyone who tries to "just
quickly add an LLM call inside forge" breaks the build.
- Architecture-as-a-tool: the 5-tier monadic law, machine-enforced.
- Pre-flight + score_patch: ASK BEFORE WRITING, not assess after.
- emergent_scan: cross-domain composition discovery — patterns developers never noticed, surfaced from recon.
- harvest: portfolio-level capability harvesting with codex-grade trust gates. Find capabilities your target lacks; graft them with formal provenance. No other tool does this.
- iterate_start / iterate_continue: agent-driven generation loop. Forge does NOT hold an LLM. The agent reasons; forge certifies between turns.
- Receipt v1: verifiable trust artifact every operation emits.
pip install atomadic-forge
forge init
# Option A — deterministic transmute, no LLM:
forge auto --target /path/to/your/legacy/repo --output ./out --apply
# Option B — cross-repo harvest over your portfolio (CLI-callable
# end of the MCP-side recon_swarm/harvest verbs):
forge cherry --target ./your-repo --pick allWire the MCP server up:
Now every MCP-aware agent in your stack — Cursor, Claude Code, Aider, Devin, your own custom orchestrator — has 36 forge tools available, including the agent-driven iterate loop and the cross-repo cherry-hunter. The certainty layer is a tool call away.
-
Round 3.5 —
evolve_start/evolve_step(recursive iterate, D_max- bounded rounds, treats prior output as the seed catalog for the next round). The substrate boundary stays clean: agent drives the LLM, forge tracks the evolve session state and orchestrates the round transitions deterministically. -
Round 5 —
emergent_swarm: feedrecon_swarm's union intoemergent_scanto surface compositions that span repos. The 25 cross-domain pipelines the lang stress test made vivid become a one-MCP-call routine. -
Round 6 —
self_evolve: forge runs forge against its own source with a target_score above current. Recursive self-improvement, trust-gated, codex-anchored.
Forge does not claim to replace human engineering judgment. It claims to make the verification step deterministic, fast, agent-callable, and mathematically rigorous — anchored to Lean-4-verified sovereign invariants — so that human judgment is reserved for what actually requires it.
Spaghetti hell, freezing over. ❄️
{ "mcpServers": { "atomadic-forge": { "command": "python", "args": ["-m", "atomadic_forge.a3_og_features.mcp_server"] } } }