release: dev → prod (P4 native tools + P6 skill layer)#84
Merged
Conversation
Agents can now call two native tools by emitting fenced forge:* blocks in their reply. The runtime parses the first block per turn, executes it, feeds the structured result back as a user message, and loops up to maxTurns (capped at 10). Tools live in tools-core/src/runtime/ and stay distinct from the host-side FileWrite : they are sandboxed to /workspace, overwrite by default (in-sandbox iteration), and surface stdout/stderr/exit/timed-out to the LLM. DockerLaunch now bind-mounts a per-run host directory at /workspace so the agent has somewhere writable, and so artifacts survive the container (used later in P5 for extraction). Runtime mode switches automatically based on AGENT.md.maxTurns : single turn keeps the P3 one-shot path, multi-turn enables the tool loop and prepends a TOOLS section to the system prompt explaining the protocol. Tool output is wrapped in [forge:tool] / [/forge:tool] markers on stdout so the host TUI can route it to its action card instead of mixing it with prose. Tests : - agent-side parser (none / tool / invalid / first-block-only) - runtime FileWrite path traversal, sandbox escape, overwrite - runtime Bash stdout/stderr/exit/timeout, cwd respected Tests use a FORGE_WORKSPACE override so they don't try to touch /workspace on the host.
Completes the P4 tool catalog. Agents now have the full six : bash,
write, read, edit, grep, glob. All sandboxed to /workspace, all callable
via fenced forge:* blocks, all validated by Zod and capped on output
size to protect the LLM context.
read — line-based offset/limit, 16 KB clip, fails on missing or
non-regular files
edit — exact substring patch, refuses ambiguous matches unless
replaceAll=true, refuses identical old/new
grep — pure JS regex over a glob filter, skips binary files (NUL
bytes), 200 hits cap, line clipped at 400 chars
glob — hand-rolled matcher for *, **, ? (no dep), 200 results cap,
walk bounded at 5000 nodes
Tool dispatcher in the runtime is now a switch over six branches.
System prompt lists all six with their JSON shape.
Tests added for each tool plus four new parser cases (forge:read /
edit / grep / glob, and a refine-rule violation on edit).
resolveSandboxedPath is now exported so tools that don't write but
still need the sandbox root (grep, glob) reuse it instead of
duplicating the FORGE_WORKSPACE override logic.
Mistral Small (and likely most small models) regularly emits an AGENT.md
where the `description` value embeds a colon — typically when listing
steps ("Step 1: ..., Step 2: ...") or quoting another key (`maxTurns: 8`,
`timeout: 60s`). YAML reads that as a nested mapping and rejects the
whole frontmatter.
Two fixes :
1. The builder system prompt now spells out the rule in both EN and FR :
no colon / no embedded YAML / wrap in double quotes if needed. Comes
with an example so the LLM has a template to follow.
2. The CLI normalizer now scans the frontmatter and wraps any
`description` value containing an unquoted colon in double quotes,
escaping any embedded double quotes in the process. Already-quoted
values are left alone.
Tests cover both : an unquoted "Step 1: ... Step 2: ..." is fixed up
and accepted ; an already-quoted equivalent is left untouched.
Cards in Mission Control are now keyboard-navigable :
- Tab cycle focus forward (lands on the most recent card
the first time)
- Shift+Tab cycle focus backward
- Enter open the focused card in a full-screen detail view
- Esc / q close the detail view
The detail view uses the entire terminal, shows the action's full
content (the AGENT.md body for write actions ; prompt + streamed
output for run actions) with line numbers, and supports scrolling
with arrow keys / PgUp / PgDn / g / G.
Tab/Enter are only captured when there are actions, no permission
dialog is up, the detail view is closed, and the prompt input is
empty — so typing in the prompt always wins. The prompt draft is
now lifted into useChat so App can read it for that guard.
Visual cues : the focused card switches to a brighter "double" border
and gains a leading triangle ; the Mission Control header changes its
hint line depending on whether anything is focused.
When a card is focused but the detail view isn't open, pressing Esc now drops the focus without opening anything. Guarded so it only fires when the prompt is empty and no permission dialog is up — Esc keeps its meaning everywhere else. Header hint updated accordingly.
feat(p4): native tools — six runtime tools inside the agent sandbox
…lder
The builder now has access to a catalog of skills : self-contained
behaviour modules that orchestrate multiple actions in a single turn
to handle recurring intent patterns. First built-in :
scaffold-and-run, which fixes the "user describes both creation and
execution but the builder stops after writing AGENT.md" pattern by
making the LLM emit a forge:write AND a forge:run in the same turn.
Architecture :
- SKILL.md format with YAML frontmatter (name, description,
triggers, actions) and a markdown body containing the
instructions. Mirrors AGENT.md to stay familiar.
- Catalog loader discovers skills from two sources : built-ins
shipped in packages/core/src/builder/skills/, plus user skills
under ~/.agent-forge/skills/. User skills override built-ins on
name collision.
- System prompt only carries the catalog metadata (name +
description + triggers) — bodies stay out of the context until
the LLM emits a forge:skill block, then the resolved body is
injected as a system message for the next turn.
- New ParsedAction kind 'skill' with a forge:skill fenced block
parser ; tolerant of either `name: <skill>` or a bare line.
- SkillAction joins WriteAction/RunAction in the action store.
Skills auto-execute (no permission dialog) and surface as their
own card in Mission Control with a "loaded into context" hint.
CardDetail renders the description plus the full body so the
user can see what the skill actually injects.
Other :
- /skills slash command lists what's available, source-tagged.
- useChat resolves the catalog once via useMemo, threads it to
streamBuilder and to executeAction's resolveSkill.
Tests :
- SKILL.md schema (kebab-case name, missing frontmatter,
unknown action tag).
- Catalog loader discovers the built-in and sorts entries.
- System prompt injects the SKILLS section when entries are
provided, FR/EN headers, base prompt unchanged when empty.
- forge:skill parser (name: prefix, bare line, kebab-case
rejection) and executeAction(skill) round-trip via resolver.
Symptom : the user typed a message that matched a skill trigger
("audite un projet typescript") but the builder still skipped
forge:skill and went straight to forge:write. The skill catalog was
loaded, the system prompt mentioned it, but Mistral would not act on
it.
Two issues, both fixed here :
1. Position. The SKILLS section was appended AFTER the base prompt's
"BE DECISIVE — propose the AGENT.md immediately" rule. Mistral read
the strong push to write first, then a soft "you can also use a
skill" at the bottom, and ignored the latter.
2. Framing. The original wording said "choose a skill when ...". Too
permissive — small models read that as optional. Replaced with a
STEP 0 / ÉTAPE 0 framing : an explicit, mandatory pre-flight check
that runs BEFORE any other action. If any trigger phrase matches
(case-insensitive substring), the LLM MUST emit a forge:skill
block as the only action of that turn ; only then does the rest
of the protocol apply.
The catalog is now placed at the TOP of the system prompt, before
the "be decisive" rule, so the order of reading mirrors the order
of execution.
Tests updated for the new wording. Triggers are now quoted in the
catalog rendering ("audite", "teste") so the LLM sees them as
literals rather than running prose.
Mistral Small reads the skill catalog and the STEP 0 instruction in
the system prompt, but it does not act on it : it sees "audite a
typescript project" and goes straight to a forge:write that
collapses both the agent definition and the run mission into one
giant AGENT.md body. Adding more rules to the prompt didn't move
the needle.
Plan B, in three pieces :
1. matchSkillForMessage() — case-insensitive substring match against
the trigger phrases declared in each SKILL.md. Lives in core, no
LLM involvement.
2. runScaffoldAndRun() — a dedicated runner that drives the skill
end to end with TWO narrow LLM calls instead of one wide one :
- call A : "produce ONLY the AGENT.md content" (generic role,
no session-specific steps in the body)
- call B : "produce ONLY the prompt to send to the agent"
Each call has a tightly scoped system instruction so the model
keeps the two artefacts cleanly separated. Output is parsed
server-side, AGENT.md name extracted from the frontmatter.
3. useChat.send() — pre-flight before the normal stream : if the
matcher finds a skill, dispatch to the runner. The skill card
lands in Mission Control as DONE, then a write card and a run
card appear as PROPOSED. The user approves them in order via
the existing permission dialog.
The system prompt no longer carries the STEP 0 / ÉTAPE 0 mandate.
Skills are now an internal mechanism the LLM is informed about but
never asked to operate. The catalog metadata stays in the prompt as
a short tail note so the LLM understands why a skill card might
appear in Mission Control.
Tests :
- matchSkillForMessage : substring match, no-match, multi-skill
precedence (first wins), empty trigger ignored.
- system prompt : informational note appears when skills are
passed, FR/EN variants, base prompt comes first.
The detail screens (Esc-Tab-Enter on a Mission Control card) used to
render every action body through highlightPlain — everything came
back as undifferentiated grey, which made long AGENT.md / agent run
outputs hard to scan.
Replaced by per-shape highlighting :
- skill detail : Markdown highlighter (headings, lists, inline code,
bold, fenced blocks). The skill body is markdown, so this matches.
- write detail : if the file has YAML frontmatter (which AGENT.md
always does), split frontmatter and body. The frontmatter goes
through the YAML highlighter (already existing), the body through
the new Markdown one. Falls back to plain YAML for files without
frontmatter.
- run detail (and the compact run card in Mission Control) : a new
highlightAgentRun() that walks the streamed output and recognises :
· fenced ```forge:* blocks (open line orange, body via the
matching language highlighter — JSON for forge:bash/write/etc.)
· [forge:tool] / [/forge:tool] markers wrapping the result of the
previous tool call (rendered dim grey so it visually recedes)
· regular prose with inline code spans and bold
New helpers in syntax.ts :
highlightMarkdown(text)
highlightAgentRun(text)
highlightYamlLine / highlightJsonLine kept exported for the run
highlighter to delegate to.
The compact card in MissionControl now also uses highlightAgentRun
so the streaming output during a long run reads the same as the
detail view, just clipped at maxLines.
When several agents run in a session, the panel was stacking 6+
fully-expanded cards and overflowing the terminal. Two changes :
1. Compact mode by default. Each non-focused card now renders as a
single line : badge + verb + target. Borders disappear, the
terminal stays calm. The focused card expands to its full preview
like before. Cards in 'running' status stay expanded too, so a
streaming agent run remains visible without having to Tab to it.
2. Bounded viewport. Mission Control now takes a panelHeight prop
(computed by App from the terminal rows minus a Welcome floor
and a spacer) and slices the action list to fit. Truncated
actions show as "↑ N above / ↓ N below" hints in the panel
header. Welcome stays glued to the bottom with flexShrink=0, so
the panel is what gives way on small terminals.
useCardFocus extended :
- scrollTop : action-index offset, advanced via PgUp/PgDn ;
- auto-focus the last new arrival when nothing is focused (the
user immediately sees what the builder just produced) ;
- auto-scroll lower bound : focusing an action above scrollTop
bumps scrollTop down to keep it visible.
App now routes PgUp/PgDn to the Mission Control scroll when there
are actions and the prompt is empty (or a card is focused). It
keeps falling back to the chat transcript scroll otherwise. Tab,
Shift+Tab, Enter, Esc unchanged.
Status badge moves to P6 done. Roadmap table lifts P4 and P6 to
✅, points to P5 (hardened sandbox + persistent agents +
artifact extraction) as the next milestone.
Root README (EN/FR) gains :
- Native tools section : six-tool table (Bash, FileWrite,
FileRead, FileEdit, Grep, Glob) with their tags and limits ;
a short note explaining the choice of a text-structured
forge:* protocol over OpenAI tool_calls.
- Skills section : SKILL.md format, two sources (built-in and
~/.agent-forge/skills/), server-side matcher, two-call
runner, scaffold-and-run as the first built-in.
- Mission Control keyboard cheatsheet : Tab / Enter / Esc /
PgUp/PgDn / Ctrl+E.
- /skills slash command listed.
- Architecture diagram updated : skill catalog + runner on the
host side, /workspace mount + tool loop on the container
side, persistence of the workspace dir after exit.
- Repo structure shows packages/core/src/builder/skills/,
runtime/src/tool-protocol.ts, and the runtime/ subdir under
tools-core.
Sub-package READMEs realigned :
- packages/cli : compact / expanded card mode, scrollable
viewport, focus + auto-scroll, detail view, full keyboard
map, dispatch skill server-side mention.
- packages/core : skill catalog / matcher / runner files
listed, scaffold-and-run noted as built-in.
- packages/runtime : multi-turn tool loop documented, six
forge:* tags, [forge:tool] markers on stdout, FORGE_MAX_TOKENS
env var.
- packages/tools-core : separate "host tools" and "runtime
tools" sections ; six runtime tools with their constraints ;
test layout listed.
feat(p6): skill layer — orchestration patterns for the builder
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings prod up to date with dev. Two milestones to release :
What's NOT here
Test plan