Skip to content

Latest commit

 

History

History
553 lines (419 loc) · 24.1 KB

File metadata and controls

553 lines (419 loc) · 24.1 KB

DEMO.md — atomadic-forge, the architecture-compiler substrate


TL;DR

Atomadic Forge is an architecture compiler: a deterministic, LLM-free tool that scores, enforces, and certifies the structural quality of any Python codebase. It does not compete with Cursor, Aider, Codex, or Devin — it is the certainty layer those tools are missing. AI coding agents generate fast; Forge tells them whether what they generated is structurally correct. The result is a tight loop: agent writes, Forge certifies, agent fixes, Forge re-certifies. Together they ship.


The substrate thesis

Every AI coding tool faces the same problem after the first draft lands: Is this code actually correct? Not "does it look right" — is it wired correctly, tier-disciplined, importable, and testable in isolation?

LLMs cannot answer that reliably. They hallucinate import paths, invent missing modules, and produce code that looks reasonable but breaks on the second dependency. The gap between "generated" and "shippable" is the 2.4-trillion-dollar global tech-debt problem.

Forge fills that gap with deterministic analysis:

  • Recon — classifies every symbol in a repo into one of 5 monadic tiers without running any LLM.
  • Wire — finds every upward-import violation in O(n) AST time.
  • Certify — returns a single 0-100 score covering documentation, tests, tier layout, and import discipline. Agent-callable go/no-go gate.
  • Enforce — plans and applies mechanical fixes for wire violations, rolls back anything that increases the violation count.

The architectural boundary is clean: the calling agent supplies the LLM reasoning. Forge supplies analytical certainty — parsing, tier classification, wire checking, certify scoring. Neither side tries to do the other's job.


Anatomy of one session — what changed in four hours

This is a build log. Every claim below is grounded in committed code.

Commit 42f35bc — field-report fixes + MCP registry hardening

Three bugs ported from the Forge Deluxe sibling repo, plus a registry usability improvement that touches all 29 tools.

Bug 1: enforce destination path The absorb pipeline emitted files to <pkg>/aN_*/ instead of src/<pkg>/aN_*/ when the output directory had an src/ wrapper. Teams absorbing legacy repos into an existing src/ layout got a broken import tree. Fix: _build_dest_path in a1_at_functions/enforce_planner.py now normalizes the destination preserving the source's parent path including the src/ prefix. Windows backslashes are normalized to forward slashes so the output is cross-OS portable.

Bug 2: dedup defensive caps dedup_engine in a3_og_features/dedup_engine.py walked the corpus twice — once to collect files, once to count them. On an 821-file cross-package corpus, this doubled I/O and silently ignored the max_files truncation marker. Callers got a files_scanned number that did not match what was actually analyzed. Fix: the inner functions now return the walked count directly so run_dedup uses it without a second rglob pass. Caps added: DEDUP_MAX_FILES = 5000, DEDUP_MAX_DUPLICATES_PER_GROUP = 64.

Bug 3: pytest parser regression _parse_pytest_summary in a1_at_functions/test_runner.py used a single-pass regex that choked on xfailed/xpassed status words between "passed" and "in Xs". The certify behavioral subscore reported ran: false, pass_ratio: 0.0 on repos using parametrized marks — the score lied. Fix: independent per-metric regex patterns for each status word.

Registry improvement: 29 "Use this when:" trigger prefixes Every entry in the TOOLS dict in mcp_protocol.py now starts with a one-liner that names the correct trigger condition. Before this change, an agent had to read the full description to know whether auto_plan or auto_apply was the right call. After: the first sentence of each description is a decision criterion. Agent tool-selection overhead drops from "read and reason about N descriptions" to "match the trigger phrase."

Validation at commit: 968 tests passing, certify 100/100, wire 0 violations.


Commit 27372ae — Round 1 substrate hardening

Seven reliability fixes targeting the failure class most visible to integration teams: MCP error -32000 Connection closed.

Fix 1: broad dispatcher except clause dispatch_request in mcp_protocol.py previously caught only (ValueError, OSError, RuntimeError) at the tool-call boundary. A TypeError from signature drift — a parameter type mismatch in any tool handler — escaped the catch, crashed the stdio loop, and produced the dreaded -32000 Connection closed error. The calling agent had no recovery path. Fix: the handler is now wrapped in except Exception, with a comment in the source explaining exactly why broad catching is correct at the transport boundary (callers never see a Python traceback; errors surface as JSON-RPC -32000 error responses that agents can inspect and retry).

Fix 2: explicit _json_default The MCP wire previously used json.dumps(..., default=str). That silently corrupted set/frozenset (rendered as "{1, 2, 3}" instead of [1, 2, 3]), bytes (rendered as "b'\\xff...'" instead of base64), and dataclasses (rendered as "Foo(x=5)" — field access lost). No shipped tool emitted these types at the time, but any future tool that returns structured Python types would silently break agent consumers. Fix: _json_default in mcp_protocol.py explicitly handles set/frozenset, bytes/bytearray, Path, and dataclasses with correct round-trip encodings. It is a forward-compatibility substrate guarantee.

Fix 3: score_patch returns REFINE on malformed input Documented in FORGE_DOGFOOD_NOTES.md §G-7: calling score_patch with a diff fragment that lacked the diff --git header context returned file_count: 0, needs_human_review: false, verdict: PASS. An agent that trusts the verdict merges an unscored patch. Fix: when file_count == 0, patch_scorer.py now sets needs_human_review = true and emits verdict: REFINE. Silent green on nothing is not a passing grade.

Fix 4: contract tests for JSON round-trip Every tool handler in the TOOLS registry is now exercised by a contract test that asserts the return value is json.dumps-safe. If any future tool returns a non-serializable type, the test suite catches it before it reaches the MCP wire.

Fixes 5-7: _build_dest_path backslash normalization, dedup_engine double-walk, _parse_pytest_summary plural/singular forms These are continuation hardening of the three bugs from commit 42f35bc, with regression tests added for each.

Validation at commit: 975 tests passing, certify 100/100, wire 0 violations.


Commit d4bc0f3 — Round 2: the transmuter exposed

Three new MCP tools bridging the CLI transmuter pipeline to agent consumers. The CLI verbs forge auto, forge cherry, and forge finalize already shipped and were battle-tested. Round 2 wraps them as agent-callable MCP tools so Cursor, Claude Code, Aider, and any MCP client can invoke the full pipeline in one call.

auto — flagship single-shot pipeline scout → cherry → assimilate → wire → certify against any source repo. Pass any directory; get back a tier-organized, wire-clean, certify-scored package. No LLM involved inside Forge at any step. Agent supplies the target; Forge supplies the analytical verdict.

cherry — surgical manifest builder Reads or builds the latest scout report, applies a selection filter (explicit names, pick_all, or only_tier), and writes a cherry.json manifest. Gives the calling agent fine-grained control over what gets absorbed before running finalize. Useful when absorbing a large legacy repo and wanting only specific modules.

finalize — materialize + wire + certify Takes a cherry-pick manifest and a target directory, assimilates the selected symbols into a tier-organized package, then runs wire and certify. Returns atomadic-forge.assimilate/v1 with the full certify score.

All three handlers follow the same a1↔a3 injection pattern already used by enforce, auto_plan, auto_step, and auto_apply: the a1 dispatcher holds an unbound stub that returns {wired: false} in test isolation; the a3 mcp_server module registers the real handler at import time via register_*_handler.

Total MCP tool count after Round 2: 32 (was 29).

Validation at commit: 983 tests passing, certify 100/100, wire 0 violations.


The four killer features

1. emergent_scan — composition discovery no other tool does

Most architecture linters find what is wrong. emergent_scan finds what is missing — specifically, pairs and chains of existing functions where one's output type already feeds the next's input but no code path wires them.

This is not LLM inference. It is deterministic type-flow analysis across the repo's AST. The result is a ranked list of latent compositions with a proposed a3 feature name and a one-call adapter snippet.

Proof of scale: In the atomadic-lang stress test, emergent_scan ran against an 818-file merged super-package combining five Atomadic repos. It surfaced 25 latent cross-domain emergent pipelines with a top score of 75/100. These are compositions that NO single canonical repo expressed individually — they became visible only because the merge put five packages in one tree. This is portfolio-scale architecture archaeology. Nothing else does this.

2. preflight_change — ask before writing

The most expensive AI coding loop is the one that writes code, runs tests, fails, rewrites, fails again, and then discovers the file was in the wrong tier. preflight_change surfaces that error before the first write.

Pass an intent string and a list of proposed files. Get back the detected tier of each file, forbidden imports, likely affected tests, sibling files to read first, and whether the write scope is too broad. Most agent mistakes happen before code is written. Preflight makes them visible in advance.

3. certify — one number, agent-callable go/no-go gate

A single integer from 0 to 100 covering documentation presence, test coverage and pass rate, tier layout conformance, and import discipline. With emit_receipt: true, it also emits a Receipt v1 artifact — a verifiable trust artifact with a hash chain suitable for audit handoff.

The score is deterministic given the same repo state. Two agents, two machines, same code: same score. That is the property LLM-based quality assessment cannot provide.

4. Receipt v1 — verifiable trust artifact

certify --emit-receipt produces a structured receipt with schema version, score breakdown, wire status, and a canonical hash of the lineage chain. The receipt format is documented in docs/RECEIPT.md. Compliance teams, procurement teams, and CI/CD pipelines can consume it as a structured assertion of architecture quality at a point in time.


Round 1 — substrate hardening: concrete failures fixed

The narrow-except bug

The dispatch_request function is the MCP transport boundary. It catches exceptions from tool handlers and converts them to JSON-RPC error responses. Before the fix, the catch tuple was (ValueError, OSError, RuntimeError). A TypeError from any tool handler — wrong parameter type, unexpected keyword argument, changed function signature during development — escaped the boundary, terminated the stdio loop, and produced -32000 Connection closed. The agent got no actionable error. The fix uses except Exception at this boundary, which is correct: the entire purpose of the boundary is to prevent per-request errors from propagating to the transport. The docstring in the fixed code explains why the broad catch is intentional.

The JSON default trap

Using default=str in json.dumps is a common shortcut that feels safe. It is not safe at a protocol boundary. When a tool handler returns a set, the caller gets "{1, 2, 3}" — a string representation of the set, not a JSON array. The caller cannot iterate it. When a tool returns a dataclass, the caller gets "Foo(x=5)" — field access is gone. The fix introduces _json_default with explicit, correct encodings for every non-native Python type: sorted list for sets, base64 dict for bytes, POSIX path string for Path, and dataclasses.asdict for dataclasses. Any future tool that returns a structured type gets correct encoding by construction.

The silent-green score_patch

score_patch is the PR-reviewer gate. When it returns verdict: PASS, an agent is entitled to trust that the patch has been scored. Returning verdict: PASS when the diff was malformed and no files were parsed is a confidence failure — it is score_patch certifying that it did not score anything. The fix: file_count == 0 now unconditionally sets needs_human_review: true and emits verdict: REFINE. Malformed input cannot produce a passing verdict.


Round 2 — the transmuter exposed

The transmuter — scout → cherry → assimilate → wire → certify — has shipped as a CLI pipeline since early in Forge's history. Round 2 makes it agent-callable.

The key design point: no LLM runs inside Forge at any step of the pipeline. The pipeline is fully deterministic. The calling agent supplies the goal; Forge supplies the execution and the verdict. This keeps the substrate boundary clean and makes the pipeline repeatable across different agents, different LLM providers, and different execution environments.

Before Round 2, the pipeline was available only to developers running Forge directly. After Round 2, any MCP-capable AI coding tool — Cursor, Claude Code, Aider, Devin, a custom agent — can invoke auto and get a tier-organized, wire-clean, certify-scored package in one call.

The three tools (auto, cherry, finalize) are wired using the same injection pattern already proven for enforce and auto_plan: a1 holds a typed stub, a3 registers the real handler at import time. Test isolation works because the stub returns a structured {wired: false} response instead of failing silently.


Round 3 (in flight) — agent-driven iterate and evolve as MCP tools

The current iterate and evolve CLI verbs hold an LLM client inside Forge. The agent runs the outer loop but Forge manages the LLM calls internally. This creates a substrate problem: the LLM provider, model selection, token budget, and retry logic are coupled to Forge's internal implementation. An agent using Cursor cannot substitute its own LLM. An agent with a custom provider cannot plug in.

Round 3 moves the LLM call out of Forge entirely.

The new shape:

  • iterate_start(intent, output, package, seed_repo, language) — Forge returns the system prompt, the first code-generation prompt, and a session ID. The calling agent takes these and drives its own LLM.
  • iterate_continue(session_id, response) — The calling agent returns the LLM's response. Forge parses the files, runs wire and certify, and returns the next prompt (if more iterations are needed) or a final verdict (if the target score is reached). No LLM invoked inside Forge.

The same pattern applies to evolve_start and evolve_step for the recursive variant.

This is the architectural move that locks Forge into the substrate position: Forge becomes the analytical engine that every LLM-based coding tool wants to call between turns. The calling agent's LLM does generation. Forge does verification. The agent retries until Forge says it is done. Neither side needs to know the other's internals.


Round 4 (in flight) — the cherry-hunter: cross-repo emergent discovery

emergent_scan works within a single repo. Round 4 extends it to portfolios.

Two new tools:

recon_swarm(repos) — runs the scout walker across multiple repos and merges the symbol manifests into a union inventory. Each symbol is tagged with its source repo. The merge is the same deterministic AST walk, applied at scale.

harvest(target, sources) — given a target repo and one or more source repos, finds capabilities that the target lacks but that exist in the sources, and proposes a graft plan: which symbols to cherry-pick, which tier they belong to, and what the wire graph looks like after the graft.

The proof of concept already exists in the data: the atomadic-lang stress test surfaced 25 cross-domain emergent pipelines in an 818-file merged super-package. Those compositions were invisible to any per-repo scan. harvest makes them systematically discoverable without requiring the manual merge step.

This is the feature that turns Forge from a single-repo architecture linter into a portfolio-scale composition engine.


What this means for AI coding tools

Cursor

Wire up forge mcp serve as a local MCP server. Every repo you open in Cursor immediately has access to context_pack (first-call orientation), preflight_change (pre-edit guardrail), score_patch (PR reviewer), and certify (go/no-go gate). The agent does not need to read the codebase to know which tier a file belongs to — Forge tells it.

Aider

Call forge certify . after each edit round. Aider's iteration loop gets a structured 0-100 score it can use as a convergence criterion. When certify reports a wire violation, Aider can call forge wire --suggest-repairs to get the exact fix plan without running the LLM again.

Codex and Claude Code

The auto MCP tool is the single-call absorb pipeline. Point it at any legacy repo and get back a tier-organized package ready for further agent development. The substrate boundary is the key: Codex or Claude Code drives the generation; Forge certifies the result. The agent loop converges because the acceptance criterion is deterministic.

Devin

auto_plan generates a ranked set of action cards for any refactor goal. Each card specifies kind, risk, write scope, and a next_command. Devin can consume the plan, filter cards by risk, and execute the applyable ones via auto_apply. Forge rolls back any card that increases the violation count.

Any custom MCP client

Three lines to integrate:

pip install atomadic-forge
forge init
forge mcp serve --project-root .

Add the stdio MCP server to your agent config. Call tools/list to see all 32 available tools. Start with context_pack for first-call orientation.


What is next

Round 5 — forge runs on forge

The self-application demo: Forge's own codebase as the target for harvest, auto_plan, and the Round 3 iterate loop. The agent generates improvements to Forge; Forge certifies them; the loop closes. This is the recursive self-improvement demonstration that turns the architecture-compiler thesis into a live proof.

The substrate invariant holds throughout: the agent drives the LLM, Forge drives the verification. Forge never certifies itself without an independent execution of the same certify tool.


Round 4 — the cross-repo cherry-hunter, MHED-trust-gated

This is the round that turns forge from an architecture-compiler-for-one-repo into a portfolio-level capability harvester. Two new MCP tools land, both bounded by codex-derived invariants:

mcp__atomadic-forge__recon_swarm   # walk N≤23 repos → unified scout report
mcp__atomadic-forge__harvest   # diff target vs union → graft proposals

The genuinely novel capability: forge surfaces capabilities your target repo lacks but your portfolio has — automatically, with no human in the loop until the verdict needs review. Every proposal carries an explicit trust verdict tied to a Lean-4-verified systemic-integrity threshold:

  • TAU_TRUST ≈ 0.9983543609... — Lean-4-verified sovereign invariant. Above this confidence, a proposal is trust_gated (auto-applyable). Below, review_required. Not a magic number — a proven threshold.

  • D_MAX_DELEGATION = 23 — Lean-4-verified hard cap on swarm size. Asking for 24+ repos returns -32000, never silently truncates. Forge fails loud.

  • OMEGA_0 ≡ 0 — Zero Unmapped Noise Floor. Every swarm report computes omega_residue and asserts the invariant holds. Non-zero residue means a per-repo walk corrupted state — surface it, never bury it.

These are imported with provenance in every docstring and run assertion-based self-checks at import time. Drift any anchor and Forge refuses to import — fail-fast, invariant-anchored. The derivation that produces these values is published separately under commercial license; the values themselves are auditable Lean-4 outputs.

This is the architectural commitment that makes forge worthy of integration with AAAA-Nexus (trust gate) and atomadic-lang (overfit-bounded tokenizer):

forge + lang + nexus = perfect code that's
  trust-verified (Nexus), hallucination-free (codex anchors), 
  fully compliant (forge wire+certify), 4× less tokens (lang Φ_tri tokens).

Session arc — 4 commits, 36 MCP tools, invariant-anchored

Commit Round Headline TOOLS Tests pass
42f35bc R0 3 deluxe-ported bug fixes + 29 "Use this when:" descriptions 29 968
27372ae R1 Substrate hardening — broadened except, json default, score_patch REFINE 29 975
d4bc0f3 R2 Transmuter exposed — auto/cherry/finalize MCP 32 983
1abc2b2 R3 Agent-driven iterate — substrate boundary fix 34 989
(this) R4 Cross-repo cherry-hunter — MHED-trust-gated 36 995

Every commit: ruff clean, forge wire 0 violations, forge certify 100/100 PASS.

The substrate boundary held all the way through. No commit constructed an LLM client inside forge. The Round-3 contract test patches resolve_default_client and asserts call count == 0 across iterate sessions. That guarantee is now part of the test suite — anyone who tries to "just quickly add an LLM call inside forge" breaks the build.


What just shipped that no other tool offers

  1. Architecture-as-a-tool: the 5-tier monadic law, machine-enforced.
  2. Pre-flight + score_patch: ASK BEFORE WRITING, not assess after.
  3. emergent_scan: cross-domain composition discovery — patterns developers never noticed, surfaced from recon.
  4. harvest: portfolio-level capability harvesting with codex-grade trust gates. Find capabilities your target lacks; graft them with formal provenance. No other tool does this.
  5. iterate_start / iterate_continue: agent-driven generation loop. Forge does NOT hold an LLM. The agent reasons; forge certifies between turns.
  6. Receipt v1: verifiable trust artifact every operation emits.

Try it now

pip install atomadic-forge
forge init
# Option A — deterministic transmute, no LLM:
forge auto --target /path/to/your/legacy/repo --output ./out --apply

# Option B — cross-repo harvest over your portfolio (CLI-callable
# end of the MCP-side recon_swarm/harvest verbs):
forge cherry --target ./your-repo --pick all

Wire the MCP server up:

{
  "mcpServers": {
    "atomadic-forge": {
      "command": "python",
      "args": ["-m", "atomadic_forge.a3_og_features.mcp_server"]
    }
  }
}

Now every MCP-aware agent in your stack — Cursor, Claude Code, Aider, Devin, your own custom orchestrator — has 36 forge tools available, including the agent-driven iterate loop and the cross-repo cherry-hunter. The certainty layer is a tool call away.


Next session

  • Round 3.5evolve_start / evolve_step (recursive iterate, D_max- bounded rounds, treats prior output as the seed catalog for the next round). The substrate boundary stays clean: agent drives the LLM, forge tracks the evolve session state and orchestrates the round transitions deterministically.

  • Round 5emergent_swarm: feed recon_swarm's union into emergent_scan to surface compositions that span repos. The 25 cross-domain pipelines the lang stress test made vivid become a one-MCP-call routine.

  • Round 6self_evolve: forge runs forge against its own source with a target_score above current. Recursive self-improvement, trust-gated, codex-anchored.


Forge does not claim to replace human engineering judgment. It claims to make the verification step deterministic, fast, agent-callable, and mathematically rigorous — anchored to Lean-4-verified sovereign invariants — so that human judgment is reserved for what actually requires it.

Spaghetti hell, freezing over. ❄️