diff --git a/.claude/skills/loop-engineering/SKILL.md b/.claude/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..a1d8b53 --- /dev/null +++ b/.claude/skills/loop-engineering/SKILL.md @@ -0,0 +1,136 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/spec-sharpen-loop/SKILL.md + - .claude/skills/sdd-build-loop/SKILL.md + - .claude/skills/pr-quality-gate/SKILL.md + - .claude/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.claude/skills/pr-quality-gate/SKILL.md b/.claude/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..f27cfc9 --- /dev/null +++ b/.claude/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,71 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/jacoco-coverage-policy/SKILL.md + - .claude/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.claude/skills/pr-review-response/SKILL.md b/.claude/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..1ca1ee1 --- /dev/null +++ b/.claude/skills/pr-review-response/SKILL.md @@ -0,0 +1,53 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/pr-quality-gate/SKILL.md + - .claude/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.claude/skills/sdd-build-loop/SKILL.md b/.claude/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..4439109 --- /dev/null +++ b/.claude/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,67 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/tdd-red-green-refactor/SKILL.md + - .claude/skills/harness-report-parsing/SKILL.md + - .claude/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.claude/skills/spec-sharpen-loop/SKILL.md b/.claude/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..b279c06 --- /dev/null +++ b/.claude/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,53 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/ears-spec-authoring/SKILL.md + - .claude/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/.github/skills/loop-engineering/SKILL.md b/.github/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..61548f2 --- /dev/null +++ b/.github/skills/loop-engineering/SKILL.md @@ -0,0 +1,136 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/spec-sharpen-loop/SKILL.md + - .github/skills/sdd-build-loop/SKILL.md + - .github/skills/pr-quality-gate/SKILL.md + - .github/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.github/skills/pr-quality-gate/SKILL.md b/.github/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..50afbd4 --- /dev/null +++ b/.github/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,71 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/jacoco-coverage-policy/SKILL.md + - .github/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.github/skills/pr-review-response/SKILL.md b/.github/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..b7ad2f1 --- /dev/null +++ b/.github/skills/pr-review-response/SKILL.md @@ -0,0 +1,53 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/pr-quality-gate/SKILL.md + - .github/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.github/skills/sdd-build-loop/SKILL.md b/.github/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..10456b3 --- /dev/null +++ b/.github/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,67 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/tdd-red-green-refactor/SKILL.md + - .github/skills/harness-report-parsing/SKILL.md + - .github/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.github/skills/spec-sharpen-loop/SKILL.md b/.github/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..00edc6e --- /dev/null +++ b/.github/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,53 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/ears-spec-authoring/SKILL.md + - .github/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/.windsurf/skills/loop-engineering/SKILL.md b/.windsurf/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..2137214 --- /dev/null +++ b/.windsurf/skills/loop-engineering/SKILL.md @@ -0,0 +1,136 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/spec-sharpen-loop/SKILL.md + - .windsurf/skills/sdd-build-loop/SKILL.md + - .windsurf/skills/pr-quality-gate/SKILL.md + - .windsurf/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.windsurf/skills/pr-quality-gate/SKILL.md b/.windsurf/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..a51fc4a --- /dev/null +++ b/.windsurf/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,71 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/jacoco-coverage-policy/SKILL.md + - .windsurf/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.windsurf/skills/pr-review-response/SKILL.md b/.windsurf/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..4d04925 --- /dev/null +++ b/.windsurf/skills/pr-review-response/SKILL.md @@ -0,0 +1,53 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/pr-quality-gate/SKILL.md + - .windsurf/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.windsurf/skills/sdd-build-loop/SKILL.md b/.windsurf/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..b58aad8 --- /dev/null +++ b/.windsurf/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,67 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/tdd-red-green-refactor/SKILL.md + - .windsurf/skills/harness-report-parsing/SKILL.md + - .windsurf/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.windsurf/skills/spec-sharpen-loop/SKILL.md b/.windsurf/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..b7386b3 --- /dev/null +++ b/.windsurf/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,53 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/ears-spec-authoring/SKILL.md + - .windsurf/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/CHANGELOG.md b/CHANGELOG.md index 14bcc99..afad296 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,8 @@ What counts as a breaking change for this toolkit is defined in [CONTRIBUTING.md ## [Unreleased] ### Added +- **Loop engineering** — the `loop-engineering` skill (the six-part loop pattern, four loop shapes, and the issue-to-merged pipeline) plus four concrete loop skills: `spec-sharpen-loop`, `sdd-build-loop`, `pr-quality-gate`, and `pr-review-response`. Mirrored across `.claude/`, `.github/`, and `.windsurf/`. +- `docs/loop-engineering.md` documenting the loop pattern, shapes, composition, and guardrails, linked from the README. - `LICENSE` (MIT) at repo root. - `.gitignore` covering macOS, IDE, Java/Maven, Node/Angular, and harness artifacts. - `.editorconfig` for consistent line endings and indentation. diff --git a/README.md b/README.md index c93f0ca..c599380 100644 --- a/README.md +++ b/README.md @@ -171,10 +171,57 @@ The same gates the agent runs are reachable from a normal terminal: ./.github/scripts/traceability.sh ``` +## Loop engineering (optional) + +Once you trust the harness, you can wrap the per-feature commands in **bounded agentic loops** that converge on a checkable goal instead of running forever or giving up too early. The toolkit ships the pattern as the `loop-engineering` skill, plus four concrete loops that compose into an issue-to-merged pipeline: + +| Loop | Skill | Drives | Cadence | +| --- | --- | --- | --- | +| Spec-sharpen | `spec-sharpen-loop` | issue → a spec that passes `/spec-review` | human-paced | +| Build | `sdd-build-loop` | sharp spec → a clean, `/review`-approved branch | self-paced | +| Ship | `pr-quality-gate` | open PR → all CI gates green | polled on CI | +| Review-response | `pr-review-response` | green PR → every reviewer thread addressed | polled on review | + +Human judgment lands at exactly two points — answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Every loop is bounded, never merges on its own, and escalates when it is stuck. + +In Claude Code these run via the native `/loop` command (for example `/loop 10m /pr-quality-gate 1234`). On Copilot and Windsurf you drive the same cadence by re-invoking the skill. See [docs/loop-engineering.md](docs/loop-engineering.md) for the full pattern, the four loop shapes, and the trust you have to earn before running the build loop. + +### Running a loop on each tool + +The skills are identical across platforms; only the way you *drive the cadence* differs. Claude Code has a native `/loop`; on Copilot and Windsurf the skill runs one pass per prompt and you re-invoke it each cycle (or stop when it reports done). + +**Claude Code** — the native `/loop` command supplies the cadence and budget: + +```text +/loop /spec-sharpen-loop 42 # issue #42 → a reviewed spec (human-paced) +/loop /sdd-build-loop 2026-05-09-create-customer # spec → clean reviewed branch (self-paced) +/loop 10m /pr-quality-gate 1234 # PR #1234 → all gates green, polling CI +/loop 15m /pr-review-response 1234 # work reviewer comments to zero +``` + +**GitHub Copilot** — no native `/loop`; prompt the skill in Copilot Chat and re-run it each cycle: + +```text +Run the pr-quality-gate loop on PR 1234: do one pass — check every gate, fix what +is red, commit and push, then stop. I'll re-run this after each CI cycle until it +posts "ready to merge". +``` + +Copilot loads `.github/skills/pr-quality-gate/SKILL.md`, does a single pass, and stops. Repeat the prompt once per CI run. + +**Windsurf** — no native `/loop`; ask Cascade to use the skill, one pass at a time: + +```text +Use the pr-quality-gate skill to drive PR 1234 to merge-ready. Do one pass: read +the gates, fix failures, commit, push, then stop. Re-invoke me on the next CI run. +``` + +Cascade activates `.windsurf/skills/pr-quality-gate/SKILL.md` by model decision. The same pattern works for all four loops — swap in `spec-sharpen-loop`, `sdd-build-loop`, or `pr-review-response` with the matching issue/feature/PR id. + ## Repository layout ```text -docs/ methodology · harness-principles · spec-format · platform-mapping · artifact-contract +docs/ methodology · harness-principles · spec-format · platform-mapping · artifact-contract · loop-engineering .claude/ agents · skills · commands · hooks · templates · checklists · maven · settings.json (Claude Code) .github/ chatmodes · prompts · instructions · skills · templates · checklists · maven · scripts · workflows/ (Copilot + CI) .windsurf/ rules · workflows · skills · templates · checklists · maven (Windsurf) @@ -222,6 +269,7 @@ The routing contract is documented in each command's `## Stack routing` section. - [docs/spec-format.md](docs/spec-format.md) — EARS-lite spec format with examples - [docs/platform-mapping.md](docs/platform-mapping.md) — how Claude/Copilot/Windsurf artifacts map - [docs/artifact-contract.md](docs/artifact-contract.md) — `.specs//` file layout and `.tdd-state.json` schema +- [docs/loop-engineering.md](docs/loop-engineering.md) — bounded agentic loops and the issue-to-merged pipeline - [examples/greenfield/README.md](examples/greenfield/README.md) — full worked feature - [examples/brownfield/README.md](examples/brownfield/README.md) — onboarding-only walkthrough diff --git a/docs/loop-engineering.md b/docs/loop-engineering.md new file mode 100644 index 0000000..1dd9696 --- /dev/null +++ b/docs/loop-engineering.md @@ -0,0 +1,80 @@ +# Loop Engineering — Driving an Issue to Merged + +Prompting an agent to fix a failing test is easy. Designing a system where one loop sharpens the spec, another builds the feature to a clean reviewed branch, a human makes the two calls that matter, and two more loops drive the pull request through CI and human review — that is the engineering. + +**Loop engineering** is designing agentic loops that converge on a goal instead of running forever or giving up too early. This toolkit ships the pattern as the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill, plus four concrete loops that wrap the existing SDD commands. + +## What an agentic loop actually is + +A loop is not "run the agent again." A loop is a control system: the agent observes the world, compares it to a goal, acts to close the gap, then observes again. The skill is not in the acting — coding agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. + +Every loop worth building has six parts. If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria** — a precise, checkable definition of "done". +- **Sensors** — how the agent observes current state. +- **Action** — what the agent changes when state does not match the goal. +- **Cadence** — how often the loop runs. +- **Bounded budget** — the escape hatch: max iterations, a deadline, a token cap. +- **Guardrails** — the rules about what the agent must *not* do. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the budget or the guardrails. + +## Four loop shapes + +The outer structure is always the same; the definition of *progress* differs. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | confidence over many runs | declaring victory after one run | + +See the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill for the detail on each shape and its tuning. + +## The four SDD loops + +Each loop wraps SDD commands that already supply the sensors. They map one-to-one onto Claude Code's `/loop` command and Skills; on Copilot and Windsurf you drive the same cadence by re-invoking the skill. + +| Loop | Skill | Wraps | Shape | Cadence | +| --- | --- | --- | --- | --- | +| 3 — Spec-sharpen | [`spec-sharpen-loop`](../.claude/skills/spec-sharpen-loop/SKILL.md) | `/spec`, `/spec-review` | converge-to-green | human-paced | +| 2 — Build | [`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) | `/plan`, `/build`, `/validate`, `/review` | queue + converge | self-paced | +| 1 — Ship | [`pr-quality-gate`](../.claude/skills/pr-quality-gate/SKILL.md) | CI gates on a PR | converge + climber | polled on CI | +| 4 — Review-response | [`pr-review-response`](../.claude/skills/pr-review-response/SKILL.md) | human review threads | queue | polled on review | + +## Composing them: issue → merged + +```text + issue + │ LOOP 3: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP 2: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP 1: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP 4: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. + +## The trust you have to earn + +The build loop is the one to be careful with, because it writes the feature instead of polishing it. It does not *remove* human review — it **relocates** it, replacing the per-task checkpoint with automated sensors (`/validate`, `/review`) and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you on the small stuff. You graduate into this loop; you do not start here. The trust checklist lives in the [`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) skill. + +## Guardrails for any loop + +- **Bound everything** — a maximum iteration count, ideally a wall-clock deadline. +- **Never let a loop merge** — loops drive to merge-*ready* and stop. +- **Forbid metric-gaming explicitly** — spell out the forbidden shortcuts. +- **Make every action reviewable and reversible** — one change per commit, branch only, never force-push. +- **Escalate on low confidence** — stopping and saying so beats a tenth desperate commit. +- **Match cadence to your sensors** — poll only when waiting on something slow. + +## When to use a loop, and when not to + +Use a loop when the goal is **objectively checkable**, progress is **incremental**, and the work is **tedious enough** that convergence dominates the thinking. A loop is the wrong tool when the goal is subjective, when there is no reliable sensor, or when the hard part is a single decision. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation.