Skip to content

release: dev → prod (P4 native tools + P6 skill layer)#84

Merged
garniergeorges merged 13 commits into
prodfrom
dev
Apr 27, 2026
Merged

release: dev → prod (P4 native tools + P6 skill layer)#84
garniergeorges merged 13 commits into
prodfrom
dev

Conversation

@garniergeorges

Copy link
Copy Markdown
Owner

Summary

Brings prod up to date with dev. Two milestones to release :

  • P4 — Six native tools (Bash, FileWrite, FileRead, FileEdit, Grep, Glob) sandboxed under `/workspace`, runtime tool-loop with `maxTurns` (cap 10), `forge:*` text-structured protocol
  • P6 — Skill layer : SKILL.md format, catalog (built-in + `~/.agent-forge/skills/`), server-side trigger matcher, two-call runner. First built-in : `scaffold-and-run`
  • Mission Control v2 — Compact mode by default, scrollable viewport, auto-focus on new arrivals, Tab/Enter/Esc keyboard nav, full-screen detail view with Markdown / YAML / JSON / agent-run highlighting
  • Docs — Root + sub-package READMEs aligned (EN/FR), P6 done badge, native tools and skills sections, keyboard cheatsheet, updated architecture diagram

What's NOT here

  • P5 (hardened sandbox + persistent agents + artifact extraction) — next milestone, on `feat/p5-hardened-sandbox`
  • P7-P9 — see roadmap in README.

Test plan

  • Pull prod after merge, `bun install`, `bun run --cwd packages/runtime build`, `bun test` green
  • `bun run forge` boots
  • `/skills` lists `scaffold-and-run · built-in`
  • "audite un projet TypeScript dans /workspace : crée src/index.ts avec deux TODO add et multiply, exécute le résultat" → 3 cards (skill DONE, write PROPOSED, run PROPOSED)
  • check-attribution CI passes

Agents can now call two native tools by emitting fenced forge:* blocks
in their reply. The runtime parses the first block per turn, executes
it, feeds the structured result back as a user message, and loops up to
maxTurns (capped at 10).

Tools live in tools-core/src/runtime/ and stay distinct from the
host-side FileWrite : they are sandboxed to /workspace, overwrite by
default (in-sandbox iteration), and surface stdout/stderr/exit/timed-out
to the LLM.

DockerLaunch now bind-mounts a per-run host directory at /workspace so
the agent has somewhere writable, and so artifacts survive the
container (used later in P5 for extraction).

Runtime mode switches automatically based on AGENT.md.maxTurns : single
turn keeps the P3 one-shot path, multi-turn enables the tool loop and
prepends a TOOLS section to the system prompt explaining the protocol.

Tool output is wrapped in [forge:tool] / [/forge:tool] markers on
stdout so the host TUI can route it to its action card instead of
mixing it with prose.

Tests :
- agent-side parser (none / tool / invalid / first-block-only)
- runtime FileWrite path traversal, sandbox escape, overwrite
- runtime Bash stdout/stderr/exit/timeout, cwd respected

Tests use a FORGE_WORKSPACE override so they don't try to touch
/workspace on the host.
Completes the P4 tool catalog. Agents now have the full six : bash,
write, read, edit, grep, glob. All sandboxed to /workspace, all callable
via fenced forge:* blocks, all validated by Zod and capped on output
size to protect the LLM context.

read   — line-based offset/limit, 16 KB clip, fails on missing or
         non-regular files
edit   — exact substring patch, refuses ambiguous matches unless
         replaceAll=true, refuses identical old/new
grep   — pure JS regex over a glob filter, skips binary files (NUL
         bytes), 200 hits cap, line clipped at 400 chars
glob   — hand-rolled matcher for *, **, ?  (no dep), 200 results cap,
         walk bounded at 5000 nodes

Tool dispatcher in the runtime is now a switch over six branches.
System prompt lists all six with their JSON shape.

Tests added for each tool plus four new parser cases (forge:read /
edit / grep / glob, and a refine-rule violation on edit).

resolveSandboxedPath is now exported so tools that don't write but
still need the sandbox root (grep, glob) reuse it instead of
duplicating the FORGE_WORKSPACE override logic.
Mistral Small (and likely most small models) regularly emits an AGENT.md
where the `description` value embeds a colon — typically when listing
steps ("Step 1: ..., Step 2: ...") or quoting another key (`maxTurns: 8`,
`timeout: 60s`). YAML reads that as a nested mapping and rejects the
whole frontmatter.

Two fixes :

1. The builder system prompt now spells out the rule in both EN and FR :
   no colon / no embedded YAML / wrap in double quotes if needed. Comes
   with an example so the LLM has a template to follow.

2. The CLI normalizer now scans the frontmatter and wraps any
   `description` value containing an unquoted colon in double quotes,
   escaping any embedded double quotes in the process. Already-quoted
   values are left alone.

Tests cover both : an unquoted "Step 1: ... Step 2: ..." is fixed up
and accepted ; an already-quoted equivalent is left untouched.
Cards in Mission Control are now keyboard-navigable :
  - Tab           cycle focus forward (lands on the most recent card
                  the first time)
  - Shift+Tab     cycle focus backward
  - Enter         open the focused card in a full-screen detail view
  - Esc / q       close the detail view

The detail view uses the entire terminal, shows the action's full
content (the AGENT.md body for write actions ; prompt + streamed
output for run actions) with line numbers, and supports scrolling
with arrow keys / PgUp / PgDn / g / G.

Tab/Enter are only captured when there are actions, no permission
dialog is up, the detail view is closed, and the prompt input is
empty — so typing in the prompt always wins. The prompt draft is
now lifted into useChat so App can read it for that guard.

Visual cues : the focused card switches to a brighter "double" border
and gains a leading triangle ; the Mission Control header changes its
hint line depending on whether anything is focused.
When a card is focused but the detail view isn't open, pressing Esc
now drops the focus without opening anything. Guarded so it only
fires when the prompt is empty and no permission dialog is up — Esc
keeps its meaning everywhere else.

Header hint updated accordingly.
feat(p4): native tools — six runtime tools inside the agent sandbox
…lder

The builder now has access to a catalog of skills : self-contained
behaviour modules that orchestrate multiple actions in a single turn
to handle recurring intent patterns. First built-in :
scaffold-and-run, which fixes the "user describes both creation and
execution but the builder stops after writing AGENT.md" pattern by
making the LLM emit a forge:write AND a forge:run in the same turn.

Architecture :
  - SKILL.md format with YAML frontmatter (name, description,
    triggers, actions) and a markdown body containing the
    instructions. Mirrors AGENT.md to stay familiar.
  - Catalog loader discovers skills from two sources : built-ins
    shipped in packages/core/src/builder/skills/, plus user skills
    under ~/.agent-forge/skills/. User skills override built-ins on
    name collision.
  - System prompt only carries the catalog metadata (name +
    description + triggers) — bodies stay out of the context until
    the LLM emits a forge:skill block, then the resolved body is
    injected as a system message for the next turn.
  - New ParsedAction kind 'skill' with a forge:skill fenced block
    parser ; tolerant of either `name: <skill>` or a bare line.
  - SkillAction joins WriteAction/RunAction in the action store.
    Skills auto-execute (no permission dialog) and surface as their
    own card in Mission Control with a "loaded into context" hint.
    CardDetail renders the description plus the full body so the
    user can see what the skill actually injects.

Other :
  - /skills slash command lists what's available, source-tagged.
  - useChat resolves the catalog once via useMemo, threads it to
    streamBuilder and to executeAction's resolveSkill.

Tests :
  - SKILL.md schema (kebab-case name, missing frontmatter,
    unknown action tag).
  - Catalog loader discovers the built-in and sorts entries.
  - System prompt injects the SKILLS section when entries are
    provided, FR/EN headers, base prompt unchanged when empty.
  - forge:skill parser (name: prefix, bare line, kebab-case
    rejection) and executeAction(skill) round-trip via resolver.
Symptom : the user typed a message that matched a skill trigger
("audite un projet typescript") but the builder still skipped
forge:skill and went straight to forge:write. The skill catalog was
loaded, the system prompt mentioned it, but Mistral would not act on
it.

Two issues, both fixed here :

1. Position. The SKILLS section was appended AFTER the base prompt's
   "BE DECISIVE — propose the AGENT.md immediately" rule. Mistral read
   the strong push to write first, then a soft "you can also use a
   skill" at the bottom, and ignored the latter.

2. Framing. The original wording said "choose a skill when ...". Too
   permissive — small models read that as optional. Replaced with a
   STEP 0 / ÉTAPE 0 framing : an explicit, mandatory pre-flight check
   that runs BEFORE any other action. If any trigger phrase matches
   (case-insensitive substring), the LLM MUST emit a forge:skill
   block as the only action of that turn ; only then does the rest
   of the protocol apply.

The catalog is now placed at the TOP of the system prompt, before
the "be decisive" rule, so the order of reading mirrors the order
of execution.

Tests updated for the new wording. Triggers are now quoted in the
catalog rendering ("audite", "teste") so the LLM sees them as
literals rather than running prose.
Mistral Small reads the skill catalog and the STEP 0 instruction in
the system prompt, but it does not act on it : it sees "audite a
typescript project" and goes straight to a forge:write that
collapses both the agent definition and the run mission into one
giant AGENT.md body. Adding more rules to the prompt didn't move
the needle.

Plan B, in three pieces :

1. matchSkillForMessage() — case-insensitive substring match against
   the trigger phrases declared in each SKILL.md. Lives in core, no
   LLM involvement.

2. runScaffoldAndRun() — a dedicated runner that drives the skill
   end to end with TWO narrow LLM calls instead of one wide one :
     - call A : "produce ONLY the AGENT.md content" (generic role,
       no session-specific steps in the body)
     - call B : "produce ONLY the prompt to send to the agent"
   Each call has a tightly scoped system instruction so the model
   keeps the two artefacts cleanly separated. Output is parsed
   server-side, AGENT.md name extracted from the frontmatter.

3. useChat.send() — pre-flight before the normal stream : if the
   matcher finds a skill, dispatch to the runner. The skill card
   lands in Mission Control as DONE, then a write card and a run
   card appear as PROPOSED. The user approves them in order via
   the existing permission dialog.

The system prompt no longer carries the STEP 0 / ÉTAPE 0 mandate.
Skills are now an internal mechanism the LLM is informed about but
never asked to operate. The catalog metadata stays in the prompt as
a short tail note so the LLM understands why a skill card might
appear in Mission Control.

Tests :
- matchSkillForMessage : substring match, no-match, multi-skill
  precedence (first wins), empty trigger ignored.
- system prompt : informational note appears when skills are
  passed, FR/EN variants, base prompt comes first.
The detail screens (Esc-Tab-Enter on a Mission Control card) used to
render every action body through highlightPlain — everything came
back as undifferentiated grey, which made long AGENT.md / agent run
outputs hard to scan.

Replaced by per-shape highlighting :

- skill detail : Markdown highlighter (headings, lists, inline code,
  bold, fenced blocks). The skill body is markdown, so this matches.

- write detail : if the file has YAML frontmatter (which AGENT.md
  always does), split frontmatter and body. The frontmatter goes
  through the YAML highlighter (already existing), the body through
  the new Markdown one. Falls back to plain YAML for files without
  frontmatter.

- run detail (and the compact run card in Mission Control) : a new
  highlightAgentRun() that walks the streamed output and recognises :
  · fenced ```forge:* blocks (open line orange, body via the
    matching language highlighter — JSON for forge:bash/write/etc.)
  · [forge:tool] / [/forge:tool] markers wrapping the result of the
    previous tool call (rendered dim grey so it visually recedes)
  · regular prose with inline code spans and bold

New helpers in syntax.ts :
  highlightMarkdown(text)
  highlightAgentRun(text)
  highlightYamlLine / highlightJsonLine kept exported for the run
  highlighter to delegate to.

The compact card in MissionControl now also uses highlightAgentRun
so the streaming output during a long run reads the same as the
detail view, just clipped at maxLines.
When several agents run in a session, the panel was stacking 6+
fully-expanded cards and overflowing the terminal. Two changes :

1. Compact mode by default. Each non-focused card now renders as a
   single line : badge + verb + target. Borders disappear, the
   terminal stays calm. The focused card expands to its full preview
   like before. Cards in 'running' status stay expanded too, so a
   streaming agent run remains visible without having to Tab to it.

2. Bounded viewport. Mission Control now takes a panelHeight prop
   (computed by App from the terminal rows minus a Welcome floor
   and a spacer) and slices the action list to fit. Truncated
   actions show as "↑ N above / ↓ N below" hints in the panel
   header. Welcome stays glued to the bottom with flexShrink=0, so
   the panel is what gives way on small terminals.

useCardFocus extended :
  - scrollTop : action-index offset, advanced via PgUp/PgDn ;
  - auto-focus the last new arrival when nothing is focused (the
    user immediately sees what the builder just produced) ;
  - auto-scroll lower bound : focusing an action above scrollTop
    bumps scrollTop down to keep it visible.

App now routes PgUp/PgDn to the Mission Control scroll when there
are actions and the prompt is empty (or a card is focused). It
keeps falling back to the chat transcript scroll otherwise. Tab,
Shift+Tab, Enter, Esc unchanged.
Status badge moves to P6 done. Roadmap table lifts P4 and P6 to
✅, points to P5 (hardened sandbox + persistent agents +
artifact extraction) as the next milestone.

Root README (EN/FR) gains :
  - Native tools section : six-tool table (Bash, FileWrite,
    FileRead, FileEdit, Grep, Glob) with their tags and limits ;
    a short note explaining the choice of a text-structured
    forge:* protocol over OpenAI tool_calls.
  - Skills section : SKILL.md format, two sources (built-in and
    ~/.agent-forge/skills/), server-side matcher, two-call
    runner, scaffold-and-run as the first built-in.
  - Mission Control keyboard cheatsheet : Tab / Enter / Esc /
    PgUp/PgDn / Ctrl+E.
  - /skills slash command listed.
  - Architecture diagram updated : skill catalog + runner on the
    host side, /workspace mount + tool loop on the container
    side, persistence of the workspace dir after exit.
  - Repo structure shows packages/core/src/builder/skills/,
    runtime/src/tool-protocol.ts, and the runtime/ subdir under
    tools-core.

Sub-package READMEs realigned :
  - packages/cli : compact / expanded card mode, scrollable
    viewport, focus + auto-scroll, detail view, full keyboard
    map, dispatch skill server-side mention.
  - packages/core : skill catalog / matcher / runner files
    listed, scaffold-and-run noted as built-in.
  - packages/runtime : multi-turn tool loop documented, six
    forge:* tags, [forge:tool] markers on stdout, FORGE_MAX_TOKENS
    env var.
  - packages/tools-core : separate "host tools" and "runtime
    tools" sections ; six runtime tools with their constraints ;
    test layout listed.
feat(p6): skill layer — orchestration patterns for the builder
@garniergeorges garniergeorges merged commit a63edd5 into prod Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant