From 26945bf611db0fe23ec176fddec5d55a08c3f060 Mon Sep 17 00:00:00 2001 From: Loiane Date: Sun, 28 Jun 2026 11:55:21 -0400 Subject: [PATCH 1/3] Add loop-engineering pattern and four SDD loop skills Introduce loop engineering: bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. - loop-engineering skill: the six-part loop pattern, four loop shapes, the issue-to-merged pipeline, and guardrails for any loop. - spec-sharpen-loop (Loop 3): issue -> a spec that passes /spec-review. - sdd-build-loop (Loop 2): sharp spec -> a clean, /review-approved branch. - pr-quality-gate (Loop 1): open PR -> all CI gates green (the ship loop). - pr-review-response (Loop 4): green PR -> every reviewer thread addressed. Mirror all five skills across .claude/, .github/, and .windsurf/ to keep tri-platform parity. Add docs/loop-engineering.md, a README section, and a CHANGELOG entry. --- .claude/skills/loop-engineering/SKILL.md | 188 +++++++++++++++++++ .claude/skills/pr-quality-gate/SKILL.md | 92 +++++++++ .claude/skills/pr-review-response/SKILL.md | 72 +++++++ .claude/skills/sdd-build-loop/SKILL.md | 97 ++++++++++ .claude/skills/spec-sharpen-loop/SKILL.md | 74 ++++++++ .github/skills/loop-engineering/SKILL.md | 188 +++++++++++++++++++ .github/skills/pr-quality-gate/SKILL.md | 92 +++++++++ .github/skills/pr-review-response/SKILL.md | 72 +++++++ .github/skills/sdd-build-loop/SKILL.md | 97 ++++++++++ .github/skills/spec-sharpen-loop/SKILL.md | 74 ++++++++ .windsurf/skills/loop-engineering/SKILL.md | 188 +++++++++++++++++++ .windsurf/skills/pr-quality-gate/SKILL.md | 92 +++++++++ .windsurf/skills/pr-review-response/SKILL.md | 72 +++++++ .windsurf/skills/sdd-build-loop/SKILL.md | 97 ++++++++++ .windsurf/skills/spec-sharpen-loop/SKILL.md | 74 ++++++++ CHANGELOG.md | 6 + README.md | 27 ++- docs/loop-engineering.md | 112 +++++++++++ 18 files changed, 1713 insertions(+), 1 deletion(-) create mode 100644 .claude/skills/loop-engineering/SKILL.md create mode 100644 .claude/skills/pr-quality-gate/SKILL.md create mode 100644 .claude/skills/pr-review-response/SKILL.md create mode 100644 .claude/skills/sdd-build-loop/SKILL.md create mode 100644 .claude/skills/spec-sharpen-loop/SKILL.md create mode 100644 .github/skills/loop-engineering/SKILL.md create mode 100644 .github/skills/pr-quality-gate/SKILL.md create mode 100644 .github/skills/pr-review-response/SKILL.md create mode 100644 .github/skills/sdd-build-loop/SKILL.md create mode 100644 .github/skills/spec-sharpen-loop/SKILL.md create mode 100644 .windsurf/skills/loop-engineering/SKILL.md create mode 100644 .windsurf/skills/pr-quality-gate/SKILL.md create mode 100644 .windsurf/skills/pr-review-response/SKILL.md create mode 100644 .windsurf/skills/sdd-build-loop/SKILL.md create mode 100644 .windsurf/skills/spec-sharpen-loop/SKILL.md create mode 100644 docs/loop-engineering.md diff --git a/.claude/skills/loop-engineering/SKILL.md b/.claude/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..9b12ec5 --- /dev/null +++ b/.claude/skills/loop-engineering/SKILL.md @@ -0,0 +1,188 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/spec-sharpen-loop/SKILL.md + - .claude/skills/sdd-build-loop/SKILL.md + - .claude/skills/pr-quality-gate/SKILL.md + - .claude/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the +> world, compare it to a goal, act to close the gap, observe again. The skill is +> not in the acting — agents are already good at acting. The skill is in defining +> *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a +prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a + script could answer yes or no to. Not "make the PR good" — "all checks green, + coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip + and the part that matters most; a loop is only as good as what it can measure. + CI status, a coverage report, a Sonar API response, the output of a test run + repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the + code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced + (drain a local queue as fast as the work allows), or human-paced (wait on an + answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock + deadline, or a token cap. This is what prevents a loop from grinding forever on + a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. + Never force-push. Never touch files outside the change. Never disable a failing + test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody +defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop +that knows when to hand the problem back is more useful than one that pretends it +can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the +same. What differs is the definition of *progress*. Tune each shape to its own +failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the +one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent +told to "increase coverage" will write tests that execute code without asserting +anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item +its own small iteration cap. When an item exhausts it, **skip it, leave it in the +queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost +nothing. "Done" is *passed N consecutive runs*, where N grows with how +intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes + `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED + feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a + converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable + reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions +at the front, and deciding a branch is worth a PR in the middle. Everything else +is a loop converging on a checkable goal. Each loop's cadence is dictated by how +fast the thing it watches can actually change — match cadence to your sensors and +you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a + wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns + the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict + will try the easy way — disabling a flaky test, writing assertion-free tests, + editing the spec so `/review` stops complaining. Spell out the forbidden + shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, + conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, + here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on + something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a +sensor returns yes/no), progress is **incremental** (each iteration gets +measurably closer), and the work is **tedious enough** that convergence dominates +the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design +elegant"), when there is no reliable sensor, or when the hard part is a single +decision rather than a sequence of mechanical steps. If you cannot write the +"are we done?" check as something that returns a boolean, you do not have a loop +yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how +a sensor would detect it, the loop almost writes itself. diff --git a/.claude/skills/pr-quality-gate/SKILL.md b/.claude/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..963a025 --- /dev/null +++ b/.claude/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,92 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/jacoco-coverage-policy/SKILL.md + - .claude/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. +> Cadence: **polled on CI** (the sensor only changes once per CI run). +> Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per +invocation. The loop runner (`/loop`) supplies the cadence and the budget; this +skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you +catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or + `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues + introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see + `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and +**STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root +cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run + *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, + write *meaningful* tests for its uncovered branches, re-check the number, + repeat with the next-lowest file. Tests must assert behavior, not just execute + lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to + silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment + explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty +seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 +iterations**, whichever comes first. On the tenth iteration without success, +summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke +this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer +leaves comments, switch to that loop to work them to zero. diff --git a/.claude/skills/pr-review-response/SKILL.md b/.claude/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..29c0471 --- /dev/null +++ b/.claude/skills/pr-review-response/SKILL.md @@ -0,0 +1,72 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/pr-quality-gate/SKILL.md + - .claude/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. +> The mirror image of the ship loop: that one converges the PR on CI, this one +> converges it on the humans. +> Cadence: **polled on review activity** (reviewers comment over hours, not +> seconds). +> Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per +invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every + **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest + correct change, commit, push, and reply in the thread linking the commit. Do + NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change + code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not + argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review +fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not +merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of + deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every +comment as "change requested" will rewrite code in response to a reviewer who was +only asking a question; a loop that treats every comment as a question will ignore +real change requests. Spelling out the difference, and forcing an escalation on +disagreement, is what keeps the loop from arguing with your reviewer on your +behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.claude/skills/sdd-build-loop/SKILL.md b/.claude/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..b4d2416 --- /dev/null +++ b/.claude/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,97 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/tdd-red-green-refactor/SKILL.md + - .claude/skills/harness-report-parsing/SKILL.md + - .claude/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** +> (fix until `/validate` and `/review` are clean). +> Cadence: **self-paced** — no CI to wait on; run the next step the moment the +> last one finishes. +> Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on +the **current feature branch**. Confirm you are **NOT on `main`** before doing +anything. The framework *is* the sensor layer here: `.tdd-state.json`, the +`/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → + done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip + it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix + the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the + smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero + must-fix findings. Leave should-fix and nit findings in the handoff summary for + the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP +and write a handoff summary listing the should-fix and nit items you deliberately +left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to + make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with + what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the +per-task checkpoint with automated sensors (`/validate`, `/review`) on the small +stuff and one deliberate human validation round at the end. That trade is only +safe once the sensors are trustworthy enough to stand in for you. A feature is +allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely + surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the + mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies + whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear + contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. +The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean +`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the +input the human PR gate wants. The human runs the app, reads the diff, and +decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop +takes over. diff --git a/.claude/skills/spec-sharpen-loop/SKILL.md b/.claude/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..81733c2 --- /dev/null +++ b/.claude/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,74 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .claude/skills/loop-engineering/SKILL.md + - .claude/skills/ears-spec-authoring/SKILL.md + - .claude/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a +> human triage step. +> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a +> round each time you answer the product questions it surfaces, then blocks again. +> Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero +open questions. Work in rounds. The build loop downstream is only safe if the spec +going in is sharp: a vague spec does not produce vague code, it produces +confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic + acceptance criteria, missing non-goals, ambiguous wording, contract or + format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, + policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human + as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another + round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — +the spec is ready for `/plan` and the build loop. This loop produces a spec and +nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec + built on guessed intent is the most expensive failure in the whole pipeline — + every downstream loop trusts this spec, and will build, test, and ship the wrong + thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself + may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is +waiting on *your* product decisions. Answer them, and it advances a round; until +you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the +skill after you have answered the questions it raised. diff --git a/.github/skills/loop-engineering/SKILL.md b/.github/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..442a87f --- /dev/null +++ b/.github/skills/loop-engineering/SKILL.md @@ -0,0 +1,188 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/spec-sharpen-loop/SKILL.md + - .github/skills/sdd-build-loop/SKILL.md + - .github/skills/pr-quality-gate/SKILL.md + - .github/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the +> world, compare it to a goal, act to close the gap, observe again. The skill is +> not in the acting — agents are already good at acting. The skill is in defining +> *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a +prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a + script could answer yes or no to. Not "make the PR good" — "all checks green, + coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip + and the part that matters most; a loop is only as good as what it can measure. + CI status, a coverage report, a Sonar API response, the output of a test run + repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the + code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced + (drain a local queue as fast as the work allows), or human-paced (wait on an + answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock + deadline, or a token cap. This is what prevents a loop from grinding forever on + a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. + Never force-push. Never touch files outside the change. Never disable a failing + test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody +defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop +that knows when to hand the problem back is more useful than one that pretends it +can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the +same. What differs is the definition of *progress*. Tune each shape to its own +failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the +one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent +told to "increase coverage" will write tests that execute code without asserting +anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item +its own small iteration cap. When an item exhausts it, **skip it, leave it in the +queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost +nothing. "Done" is *passed N consecutive runs*, where N grows with how +intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes + `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED + feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a + converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable + reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions +at the front, and deciding a branch is worth a PR in the middle. Everything else +is a loop converging on a checkable goal. Each loop's cadence is dictated by how +fast the thing it watches can actually change — match cadence to your sensors and +you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a + wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns + the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict + will try the easy way — disabling a flaky test, writing assertion-free tests, + editing the spec so `/review` stops complaining. Spell out the forbidden + shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, + conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, + here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on + something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a +sensor returns yes/no), progress is **incremental** (each iteration gets +measurably closer), and the work is **tedious enough** that convergence dominates +the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design +elegant"), when there is no reliable sensor, or when the hard part is a single +decision rather than a sequence of mechanical steps. If you cannot write the +"are we done?" check as something that returns a boolean, you do not have a loop +yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how +a sensor would detect it, the loop almost writes itself. diff --git a/.github/skills/pr-quality-gate/SKILL.md b/.github/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..22a42f1 --- /dev/null +++ b/.github/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,92 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/jacoco-coverage-policy/SKILL.md + - .github/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. +> Cadence: **polled on CI** (the sensor only changes once per CI run). +> Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per +invocation. The loop runner (`/loop`) supplies the cadence and the budget; this +skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you +catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or + `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues + introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see + `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and +**STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root +cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run + *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, + write *meaningful* tests for its uncovered branches, re-check the number, + repeat with the next-lowest file. Tests must assert behavior, not just execute + lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to + silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment + explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty +seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 +iterations**, whichever comes first. On the tenth iteration without success, +summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke +this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer +leaves comments, switch to that loop to work them to zero. diff --git a/.github/skills/pr-review-response/SKILL.md b/.github/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..0738a3a --- /dev/null +++ b/.github/skills/pr-review-response/SKILL.md @@ -0,0 +1,72 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/pr-quality-gate/SKILL.md + - .github/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. +> The mirror image of the ship loop: that one converges the PR on CI, this one +> converges it on the humans. +> Cadence: **polled on review activity** (reviewers comment over hours, not +> seconds). +> Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per +invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every + **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest + correct change, commit, push, and reply in the thread linking the commit. Do + NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change + code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not + argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review +fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not +merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of + deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every +comment as "change requested" will rewrite code in response to a reviewer who was +only asking a question; a loop that treats every comment as a question will ignore +real change requests. Spelling out the difference, and forcing an escalation on +disagreement, is what keeps the loop from arguing with your reviewer on your +behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.github/skills/sdd-build-loop/SKILL.md b/.github/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..5ca435b --- /dev/null +++ b/.github/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,97 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/tdd-red-green-refactor/SKILL.md + - .github/skills/harness-report-parsing/SKILL.md + - .github/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** +> (fix until `/validate` and `/review` are clean). +> Cadence: **self-paced** — no CI to wait on; run the next step the moment the +> last one finishes. +> Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on +the **current feature branch**. Confirm you are **NOT on `main`** before doing +anything. The framework *is* the sensor layer here: `.tdd-state.json`, the +`/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → + done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip + it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix + the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the + smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero + must-fix findings. Leave should-fix and nit findings in the handoff summary for + the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP +and write a handoff summary listing the should-fix and nit items you deliberately +left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to + make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with + what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the +per-task checkpoint with automated sensors (`/validate`, `/review`) on the small +stuff and one deliberate human validation round at the end. That trade is only +safe once the sensors are trustworthy enough to stand in for you. A feature is +allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely + surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the + mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies + whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear + contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. +The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean +`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the +input the human PR gate wants. The human runs the app, reads the diff, and +decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop +takes over. diff --git a/.github/skills/spec-sharpen-loop/SKILL.md b/.github/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..c649589 --- /dev/null +++ b/.github/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,74 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .github/skills/loop-engineering/SKILL.md + - .github/skills/ears-spec-authoring/SKILL.md + - .github/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a +> human triage step. +> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a +> round each time you answer the product questions it surfaces, then blocks again. +> Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero +open questions. Work in rounds. The build loop downstream is only safe if the spec +going in is sharp: a vague spec does not produce vague code, it produces +confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic + acceptance criteria, missing non-goals, ambiguous wording, contract or + format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, + policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human + as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another + round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — +the spec is ready for `/plan` and the build loop. This loop produces a spec and +nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec + built on guessed intent is the most expensive failure in the whole pipeline — + every downstream loop trusts this spec, and will build, test, and ship the wrong + thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself + may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is +waiting on *your* product decisions. Answer them, and it advances a round; until +you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the +skill after you have answered the questions it raised. diff --git a/.windsurf/skills/loop-engineering/SKILL.md b/.windsurf/skills/loop-engineering/SKILL.md new file mode 100644 index 0000000..49af004 --- /dev/null +++ b/.windsurf/skills/loop-engineering/SKILL.md @@ -0,0 +1,188 @@ +--- +name: loop-engineering +description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline. +when_to_use: + - Before building any loop — to name the six parts and pick the right shape. + - When composing the spec-sharpen, build, ship, and review-response loops into one pipeline. + - When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/spec-sharpen-loop/SKILL.md + - .windsurf/skills/sdd-build-loop/SKILL.md + - .windsurf/skills/pr-quality-gate/SKILL.md + - .windsurf/skills/pr-review-response/SKILL.md +--- + +# Loop engineering + +> A loop is not "run the agent again." A loop is a control system: observe the +> world, compare it to a goal, act to close the gap, observe again. The skill is +> not in the acting — agents are already good at acting. The skill is in defining +> *what the agent observes* and *when it is allowed to stop*. + +## The six parts of every loop + +If you can name all six for your task, you have a loop. If you cannot, you have a +prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a + script could answer yes or no to. Not "make the PR good" — "all checks green, + coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip + and the part that matters most; a loop is only as good as what it can measure. + CI status, a coverage report, a Sonar API response, the output of a test run + repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the + code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced + (drain a local queue as fast as the work allows), or human-paced (wait on an + answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock + deadline, or a token cap. This is what prevents a loop from grinding forever on + a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. + Never force-push. Never touch files outside the change. Never disable a failing + test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody +defined the **budget** or the **guardrails**. Those last two are non-negotiable. + +## The shape of a loop + +```text + ┌──────────────────────┐ + │ Start: artifact open │ + └──────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + ┌────▶│ OBSERVE (run sensors) │ + │ └────────────┬─────────────┘ + │ ▼ + │ ┌─────────────┐ yes ┌─────────────────────┐ + │ │ goal met? │──────▶│ STOP: done ✅ │ + │ └──────┬──────┘ └─────────────────────┘ + │ no │ + │ ▼ + │ ┌──────────────────┐ + │ │ ACT: smallest │ + │ │ correct change │ + │ └────────┬─────────┘ + │ ▼ + │ ┌─────────────┐ no ┌─────────────────────┐ + │ │ budget left?│──────▶│ STOP: escalate 🚩 │ + │ └──────┬──────┘ └─────────────────────┘ + │ yes │ + │ ▼ + │ ┌──────────────────┐ + └─────────│ WAIT (cadence) │ + └──────────────────┘ +``` + +The escalation branch is not a failure — it is the loop doing its job. A loop +that knows when to hand the problem back is more useful than one that pretends it +can solve everything. + +## Four loop shapes + +The outer structure (observe → compare → act → check budget → repeat) is the +same. What differs is the definition of *progress*. Tune each shape to its own +failure mode. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run | + +### Converge-to-green + +State is binary per check. Re-run **all** sensors after each change, not just the +one you were working on, so you notice when you traded one red for another. + +### Incremental-to-a-number + +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent +told to "increase coverage" will write tests that execute code without asserting +anything, because that moves the metric. Your success criteria must forbid it. + +### Queue / batch + +Drain a work queue one item at a time; "done" is an empty queue. Give each item +its own small iteration cap. When an item exhausts it, **skip it, leave it in the +queue, and report it** — do not let one hard item block the easy ones. + +### Repeat-until-confident + +The signal is non-deterministic, so a single observation tells you almost +nothing. "Done" is *passed N consecutive runs*, where N grows with how +intermittent the original failure was. Never declare victory after one run. + +## The four SDD loops + +These wrap the existing SDD commands, which already supply the sensors: + +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes + `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED + feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a + converge-to-green (fix until `/validate` and `/review` are clean). +3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. +4. **Review-response** (`pr-review-response`) — green PR → every actionable + reviewer thread addressed. Polled on review activity. + +## Composing them: issue → merged + +```text + issue + │ LOOP: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions +at the front, and deciding a branch is worth a PR in the middle. Everything else +is a loop converging on a checkable goal. Each loop's cadence is dictated by how +fast the thing it watches can actually change — match cadence to your sensors and +you never pay for a check that cannot tell you anything new. + +## Guardrails for any loop + +- **Bound everything.** Every loop needs a maximum iteration count, ideally a + wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns + the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict + will try the easy way — disabling a flaky test, writing assertion-free tests, + editing the spec so `/review` stops complaining. Spell out the forbidden + shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, + conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, + here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on + something slow like CI. When sensors are local, run as fast as the work allows. + +## When to use a loop, and when not to + +Use a loop when three things are true: the goal is **objectively checkable** (a +sensor returns yes/no), progress is **incremental** (each iteration gets +measurably closer), and the work is **tedious enough** that convergence dominates +the thinking. + +A loop is the *wrong* tool when the goal is subjective ("make this API design +elegant"), when there is no reliable sensor, or when the hard part is a single +decision rather than a sequence of mechanical steps. If you cannot write the +"are we done?" check as something that returns a boolean, you do not have a loop +yet — you have a conversation, and you should just have the conversation. + +The criteria *is* the work. Once you can state precisely what "done" means and how +a sensor would detect it, the loop almost writes itself. diff --git a/.windsurf/skills/pr-quality-gate/SKILL.md b/.windsurf/skills/pr-quality-gate/SKILL.md new file mode 100644 index 0000000..0eb07a1 --- /dev/null +++ b/.windsurf/skills/pr-quality-gate/SKILL.md @@ -0,0 +1,92 @@ +--- +name: pr-quality-gate +description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges. +when_to_use: + - A PR is open and you want it driven to all-green without babysitting each CI round. + - After the build loop hands you a feature branch and you have opened the PR. + - Run as a recurring loop: `/loop 10m /pr-quality-gate ` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/jacoco-coverage-policy/SKILL.md + - .windsurf/skills/pr-review-response/SKILL.md +--- + +# PR quality gate (the ship loop) + +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. +> Cadence: **polled on CI** (the sensor only changes once per CI run). +> Budget: **10 iterations**, then escalate. + +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per +invocation. The loop runner (`/loop`) supplies the cadence and the budget; this +skill supplies the goal, the sensors, the action, and the guardrails. + +## Sensors — gather current state first + +Re-run *all* of these every pass, not just the gate you fixed last time, so you +catch a fix that traded one red for another: + +- `gh pr checks $1` — CI check status (unit, integration, architecture tests). +- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. +- The coverage report from the latest CI run artifacts (or + `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues + introduced by this PR. + +## Success criteria (the goal) + +The PR is merge-ready ONLY when ALL of these are true: + +- Every CI check is passing. +- Checkstyle / static analysis reports zero new violations on this PR. +- Line coverage on new code is at or above the project threshold (see + `jacoco-coverage-policy`). +- Sonar (if configured) reports zero new issues on this PR. + +When all are true, post a comment `✅ All gates green — ready to merge` and +**STOP**. Do not merge the PR yourself. + +## Action — fix exactly what is failing + +For each failing gate, make the **smallest correct change that fixes the root +cause**, then commit with a conventional-commit message and push to the PR branch. + +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run + *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, + write *meaningful* tests for its uncovered branches, re-check the number, + repeat with the next-lowest file. Tests must assert behavior, not just execute + lines. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or close the PR. +- NEVER force-push or rewrite history. +- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to + silence a finding instead of fixing it. +- NEVER touch files unrelated to this PR's diff. +- If a failure is ambiguous, or you are not confident in the fix, post a comment + explaining what you observed and what you tried, then STOP for a human. + +## Cadence and budget (the loop wrapper) + +Poll roughly as often as CI returns — there is no point checking every thirty +seconds when the signal only changes once per CI run: + +```text +/loop 10m /pr-quality-gate 1234 +``` + +Stop when the PR posts the "ready to merge" comment, **or** after **10 +iterations**, whichever comes first. On the tenth iteration without success, +summarize what is still failing and what you tried, then stop and escalate. + +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke +this skill once per CI cycle and stop on the same two conditions. + +## Handoff + +A green PR is the input the `pr-review-response` loop wants. When a human reviewer +leaves comments, switch to that loop to work them to zero. diff --git a/.windsurf/skills/pr-review-response/SKILL.md b/.windsurf/skills/pr-review-response/SKILL.md new file mode 100644 index 0000000..703c2bc --- /dev/null +++ b/.windsurf/skills/pr-review-response/SKILL.md @@ -0,0 +1,72 @@ +--- +name: pr-review-response +description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread. +when_to_use: + - A green PR is under human review and the comments need to be worked to zero. + - After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback. + - Run as a loop polled on review activity: `/loop 15m /pr-review-response ` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/pr-quality-gate/SKILL.md + - .windsurf/skills/spring-code-review-rubric/SKILL.md +--- + +# PR review-response loop (end of the pipeline) + +> Loop shape: **queue** — drain the unresolved review threads one at a time. +> The mirror image of the ship loop: that one converges the PR on CI, this one +> converges it on the humans. +> Cadence: **polled on review activity** (reviewers comment over hours, not +> seconds). +> Budget: escalate on disagreement or ambiguity; never argue in a thread. + +You are responding to human review on PR **#$1**. Do exactly **one pass** per +invocation. + +## Sensors + +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every + **UNRESOLVED** thread. +- `gh pr checks $1` — current gate status. + +## Triage each unresolved thread + +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest + correct change, commit, push, and reply in the thread linking the commit. Do + NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change + code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not + argue in the thread, and do not silently comply. + +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review +fix does not re-break CI. + +## Success criteria (the goal) + +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not +merge. + +## Guardrails — non-negotiable + +- NEVER merge, approve, or resolve a reviewer's thread. +- NEVER push back on a reviewer; escalate disagreements to your human. +- One logical change per commit; never force-push. +- If a comment needs a product or architecture decision, escalate instead of + deciding it inside the loop. + +## Cadence (the loop wrapper) + +Poll on an interval, because the sensor is human review activity: + +```text +/loop 15m /pr-review-response 1234 +``` + +The hard part is not the editing, it is the **triage**. A loop that treats every +comment as "change requested" will rewrite code in response to a reviewer who was +only asking a question; a loop that treats every comment as a question will ignore +real change requests. Spelling out the difference, and forcing an escalation on +disagreement, is what keeps the loop from arguing with your reviewer on your +behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.windsurf/skills/sdd-build-loop/SKILL.md b/.windsurf/skills/sdd-build-loop/SKILL.md new file mode 100644 index 0000000..9234cdc --- /dev/null +++ b/.windsurf/skills/sdd-build-loop/SKILL.md @@ -0,0 +1,97 @@ +--- +name: sdd-build-loop +description: Drive an SDD feature from its task queue to a clean, reviewed branch — drain every `/build` task, then converge `/validate` and `/review` to green, auto-committing on the feature branch. The build loop. Trust-gated; runs only on a feature branch and never opens a PR or merges. +when_to_use: + - A feature has a sharp spec, `/plan` has produced `.tdd-state.json`, and you want the task queue driven to a reviewed branch. + - Only after you have earned trust in the harness (see the trust checklist below). Otherwise run `/build` task-by-task with your eyes on every diff. + - Run as a self-paced loop: `/loop /sdd-build-loop ` in Claude Code, or re-invoke this skill until the success criteria are met on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/tdd-red-green-refactor/SKILL.md + - .windsurf/skills/harness-report-parsing/SKILL.md + - .windsurf/skills/spring-code-review-rubric/SKILL.md +--- + +# SDD build loop (the build loop) + +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** +> (fix until `/validate` and `/review` are clean). +> Cadence: **self-paced** — no CI to wait on; run the next step the moment the +> last one finishes. +> Budget: a cap on review-fix rounds, then escalate. + +You are building the feature in `.specs/$1` to a reviewed, committable state on +the **current feature branch**. Confirm you are **NOT on `main`** before doing +anything. The framework *is* the sensor layer here: `.tdd-state.json`, the +`/validate` output, and the `/review` verdict. + +## Phase 1 — drain the task queue (queue shape) + +1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. +2. Run `/build` for that task. Let it move `pending → red → green → refactor → + done` and commit on completion (one task per commit). +3. Repeat until no `pending` tasks remain. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip + it silently — a missing task is not a done feature. + +## Phase 2 — validate and review (converge-to-green) + +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix + the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the + smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero + must-fix findings. Leave should-fix and nit findings in the handoff summary for + the human — do not gold-plate. + +## Success criteria (the goal) + +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP +and write a handoff summary listing the should-fix and nit items you deliberately +left. **DO NOT open a PR or merge** — that is the human's call. + +## Guardrails — non-negotiable + +- Work ONLY on the feature branch. NEVER commit to `main`. +- NEVER merge, and NEVER open the PR. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to + make a gate pass. +- One logical change per commit; conventional messages; never force-push. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with + what is still failing and what you tried. + +## The trust you have to earn + +This loop does not *remove* human review — it **relocates** it. It replaces the +per-task checkpoint with automated sensors (`/validate`, `/review`) on the small +stuff and one deliberate human validation round at the end. That trade is only +safe once the sensors are trustworthy enough to stand in for you. A feature is +allowed through this loop only when: + +- You have shipped enough features with the framework that `/review` rarely + surfaces something you disagree with. +- Your skills, agents, and review rubric encode *your* team's standards. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the + mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies + whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear + contract), not a novel architectural decision. + +If those conditions are not met, run the framework the normal way, task by task. +The loop is an optimization for the cases you already understand. + +## Cadence and handoff (the loop wrapper) + +Because there is no CI to wait on, this loop self-paces: + +```text +/loop /sdd-build-loop 2026-05-09-create-new-customer +``` + +When it stops you have a feature branch with a drained task queue, a clean +`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the +input the human PR gate wants. The human runs the app, reads the diff, and +decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop +takes over. diff --git a/.windsurf/skills/spec-sharpen-loop/SKILL.md b/.windsurf/skills/spec-sharpen-loop/SKILL.md new file mode 100644 index 0000000..4d12a40 --- /dev/null +++ b/.windsurf/skills/spec-sharpen-loop/SKILL.md @@ -0,0 +1,74 @@ +--- +name: spec-sharpen-loop +description: Turn a raw GitHub issue into a spec that passes `/spec-review` with zero open questions — resolving what it can from existing material and escalating genuine product decisions to a human. The front of the pipeline. Produces a spec and nothing else. +when_to_use: + - A GitHub issue needs to become a sharp spec before `/plan` and the build loop. + - Whenever `/spec` surfaces open questions and you want them triaged and driven to zero in rounds. + - Run as a human-paced loop: `/loop /spec-sharpen-loop ` in Claude Code, or re-invoke this skill each time you answer the product questions on Copilot/Windsurf. +authoritative_references: + - docs/loop-engineering.md + - .windsurf/skills/loop-engineering/SKILL.md + - .windsurf/skills/ears-spec-authoring/SKILL.md + - .windsurf/skills/issue-tracker-ingestion/SKILL.md +--- + +# Spec-sharpen loop (front of the pipeline) + +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a +> human triage step. +> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a +> round each time you answer the product questions it surfaces, then blocks again. +> Budget: escalate if the same question reappears after being answered. + +You are turning issue **#$1** into a spec that passes `/spec-review` with zero +open questions. Work in rounds. The build loop downstream is only safe if the spec +going in is sharp: a vague spec does not produce vague code, it produces +confident, wrong code, faster. + +## Sensors + +- The open-question list `/spec` surfaces for this issue. +- The `/spec-review` verdict (PASS/FAIL) and its findings. + +## Each round + +1. Run `/spec` for the issue (or re-run after answers have been added). +2. Triage **every** open question into one of two buckets: + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic + acceptance criteria, missing non-goals, ambiguous wording, contract or + format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, + policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human + as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another + round. + +## Success criteria (the goal) + +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — +the spec is ready for `/plan` and the build loop. This loop produces a spec and +nothing else. + +## Guardrails — non-negotiable + +- NEVER answer a product-decision question yourself to force a PASS. A clean spec + built on guessed intent is the most expensive failure in the whole pipeline — + every downstream loop trusts this spec, and will build, test, and ship the wrong + thing with great efficiency. +- NEVER start design or code. No `/plan`, no `/build`. +- If the same question reappears after being answered, escalate; the issue itself + may be underspecified. + +## Cadence (the loop wrapper) + +This is the one loop that pauses inside itself for a human on purpose: + +```text +/loop /spec-sharpen-loop 1 +``` + +It is human-paced by design. You are not waiting on a clock or on CI; the loop is +waiting on *your* product decisions. Answer them, and it advances a round; until +you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the +skill after you have answered the questions it raised. diff --git a/CHANGELOG.md b/CHANGELOG.md index 14bcc99..3bc407e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,12 @@ What counts as a breaking change for this toolkit is defined in [CONTRIBUTING.md ## [Unreleased] ### Added +- **Loop engineering** — the `loop-engineering` skill (the six-part loop pattern, + four loop shapes, and the issue-to-merged pipeline) plus four concrete loop + skills: `spec-sharpen-loop`, `sdd-build-loop`, `pr-quality-gate`, and + `pr-review-response`. Mirrored across `.claude/`, `.github/`, and `.windsurf/`. +- `docs/loop-engineering.md` documenting the loop pattern, shapes, composition, + and guardrails, linked from the README. - `LICENSE` (MIT) at repo root. - `.gitignore` covering macOS, IDE, Java/Maven, Node/Angular, and harness artifacts. - `.editorconfig` for consistent line endings and indentation. diff --git a/README.md b/README.md index c93f0ca..5aae53b 100644 --- a/README.md +++ b/README.md @@ -171,10 +171,34 @@ The same gates the agent runs are reachable from a normal terminal: ./.github/scripts/traceability.sh ``` +## Loop engineering (optional) + +Once you trust the harness, you can wrap the per-feature commands in **bounded +agentic loops** that converge on a checkable goal instead of running forever or +giving up too early. The toolkit ships the pattern as the `loop-engineering` +skill, plus four concrete loops that compose into an issue-to-merged pipeline: + +| Loop | Skill | Drives | Cadence | +| --- | --- | --- | --- | +| Spec-sharpen | `spec-sharpen-loop` | issue → a spec that passes `/spec-review` | human-paced | +| Build | `sdd-build-loop` | sharp spec → a clean, `/review`-approved branch | self-paced | +| Ship | `pr-quality-gate` | open PR → all CI gates green | polled on CI | +| Review-response | `pr-review-response` | green PR → every reviewer thread addressed | polled on review | + +Human judgment lands at exactly two points — answering genuine product questions +at the front, and deciding a branch is worth a PR in the middle. Every loop is +bounded, never merges on its own, and escalates when it is stuck. + +In Claude Code these run via the native `/loop` command (for example +`/loop 10m /pr-quality-gate 1234`). On Copilot and Windsurf you drive the same +cadence by re-invoking the skill. See +[docs/loop-engineering.md](docs/loop-engineering.md) for the full pattern, the +four loop shapes, and the trust you have to earn before running the build loop. + ## Repository layout ```text -docs/ methodology · harness-principles · spec-format · platform-mapping · artifact-contract +docs/ methodology · harness-principles · spec-format · platform-mapping · artifact-contract · loop-engineering .claude/ agents · skills · commands · hooks · templates · checklists · maven · settings.json (Claude Code) .github/ chatmodes · prompts · instructions · skills · templates · checklists · maven · scripts · workflows/ (Copilot + CI) .windsurf/ rules · workflows · skills · templates · checklists · maven (Windsurf) @@ -222,6 +246,7 @@ The routing contract is documented in each command's `## Stack routing` section. - [docs/spec-format.md](docs/spec-format.md) — EARS-lite spec format with examples - [docs/platform-mapping.md](docs/platform-mapping.md) — how Claude/Copilot/Windsurf artifacts map - [docs/artifact-contract.md](docs/artifact-contract.md) — `.specs//` file layout and `.tdd-state.json` schema +- [docs/loop-engineering.md](docs/loop-engineering.md) — bounded agentic loops and the issue-to-merged pipeline - [examples/greenfield/README.md](examples/greenfield/README.md) — full worked feature - [examples/brownfield/README.md](examples/brownfield/README.md) — onboarding-only walkthrough diff --git a/docs/loop-engineering.md b/docs/loop-engineering.md new file mode 100644 index 0000000..945c96f --- /dev/null +++ b/docs/loop-engineering.md @@ -0,0 +1,112 @@ +# Loop Engineering — Driving an Issue to Merged + +Prompting an agent to fix a failing test is easy. Designing a system where one +loop sharpens the spec, another builds the feature to a clean reviewed branch, a +human makes the two calls that matter, and two more loops drive the pull request +through CI and human review — that is the engineering. + +**Loop engineering** is designing agentic loops that converge on a goal instead of +running forever or giving up too early. This toolkit ships the pattern as the +[`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill, plus four +concrete loops that wrap the existing SDD commands. + +## What an agentic loop actually is + +A loop is not "run the agent again." A loop is a control system: the agent +observes the world, compares it to a goal, acts to close the gap, then observes +again. The skill is not in the acting — coding agents are already good at acting. +The skill is in defining *what the agent observes* and *when it is allowed to +stop*. + +Every loop worth building has six parts. If you can name all six for your task, +you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria** — a precise, checkable definition of "done". +- **Sensors** — how the agent observes current state. +- **Action** — what the agent changes when state does not match the goal. +- **Cadence** — how often the loop runs. +- **Bounded budget** — the escape hatch: max iterations, a deadline, a token cap. +- **Guardrails** — the rules about what the agent must *not* do. + +Most failed loops have a clear goal and good sensors, then run away because nobody +defined the budget or the guardrails. + +## Four loop shapes + +The outer structure is always the same; the definition of *progress* differs. + +| Shape | "Done" means | Failure mode to watch | +| --- | --- | --- | +| Converge-to-green | every binary check passes | oscillating: fixing A breaks B | +| Incremental-to-a-number | a metric crosses a threshold | gaming the metric | +| Queue / batch | the work queue is empty | one hard item blocks the rest | +| Repeat-until-confident | confidence over many runs | declaring victory after one run | + +See the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill for +the detail on each shape and its tuning. + +## The four SDD loops + +Each loop wraps SDD commands that already supply the sensors. They map one-to-one +onto Claude Code's `/loop` command and Skills; on Copilot and Windsurf you drive +the same cadence by re-invoking the skill. + +| Loop | Skill | Wraps | Shape | Cadence | +| --- | --- | --- | --- | --- | +| 3 — Spec-sharpen | [`spec-sharpen-loop`](../.claude/skills/spec-sharpen-loop/SKILL.md) | `/spec`, `/spec-review` | converge-to-green | human-paced | +| 2 — Build | [`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) | `/plan`, `/build`, `/validate`, `/review` | queue + converge | self-paced | +| 1 — Ship | [`pr-quality-gate`](../.claude/skills/pr-quality-gate/SKILL.md) | CI gates on a PR | converge + climber | polled on CI | +| 4 — Review-response | [`pr-review-response`](../.claude/skills/pr-review-response/SKILL.md) | human review threads | queue | polled on review | + +## Composing them: issue → merged + +```text + issue + │ LOOP 3: spec-sharpen (human-paced) → escalate product decisions + ▼ sharp spec + │ LOOP 2: build (self-paced) → auto-commit on a feature branch + ▼ clean branch + │ HUMAN GATE: run the app, read the diff, decide — open the PR? + ▼ pull request + │ LOOP 1: ship (polled on CI) → all gates green + ▼ green PR, under review + │ LOOP 4: review-response (polled on review) → no threads remain + ▼ + human merges +``` + +Human judgment lands at exactly two points: answering genuine product questions at +the front, and deciding a branch is worth a PR in the middle. Everything else is a +loop converging on a checkable goal. Each loop's cadence is dictated by how fast +the thing it watches can actually change — match cadence to your sensors and you +never pay for a check that cannot tell you anything new. + +## The trust you have to earn + +The build loop is the one to be careful with, because it writes the feature +instead of polishing it. It does not *remove* human review — it **relocates** it, +replacing the per-task checkpoint with automated sensors (`/validate`, `/review`) +and one deliberate human validation round at the end. That trade is only safe once +the sensors are trustworthy enough to stand in for you on the small stuff. You +graduate into this loop; you do not start here. The trust checklist lives in the +[`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) skill. + +## Guardrails for any loop + +- **Bound everything** — a maximum iteration count, ideally a wall-clock deadline. +- **Never let a loop merge** — loops drive to merge-*ready* and stop. +- **Forbid metric-gaming explicitly** — spell out the forbidden shortcuts. +- **Make every action reviewable and reversible** — one change per commit, branch + only, never force-push. +- **Escalate on low confidence** — stopping and saying so beats a tenth desperate + commit. +- **Match cadence to your sensors** — poll only when waiting on something slow. + +## When to use a loop, and when not to + +Use a loop when the goal is **objectively checkable**, progress is +**incremental**, and the work is **tedious enough** that convergence dominates the +thinking. A loop is the wrong tool when the goal is subjective, when there is no +reliable sensor, or when the hard part is a single decision. If you cannot write +the "are we done?" check as something that returns a boolean, you do not have a +loop yet — you have a conversation, and you should just have the conversation. From e03c703a76d2b3426081644f2b446e7351e32813 Mon Sep 17 00:00:00 2001 From: Loiane Date: Sun, 28 Jun 2026 12:01:26 -0400 Subject: [PATCH 2/3] Soft-wrap loop-engineering markdown to match repo convention Prose and list items were hard-wrapped at ~80 chars, which fights markdown. Reflow to one line per paragraph/list item across the five loop skills (all three platform mirrors), docs/loop-engineering.md, the README section, and the CHANGELOG entry. Tables, code blocks, and diagrams are unchanged. --- .claude/skills/loop-engineering/SKILL.md | 112 +++++-------------- .claude/skills/pr-quality-gate/SKILL.md | 53 +++------ .claude/skills/pr-review-response/SKILL.md | 39 ++----- .claude/skills/sdd-build-loop/SKILL.md | 64 +++-------- .claude/skills/spec-sharpen-loop/SKILL.md | 41 ++----- .github/skills/loop-engineering/SKILL.md | 112 +++++-------------- .github/skills/pr-quality-gate/SKILL.md | 53 +++------ .github/skills/pr-review-response/SKILL.md | 39 ++----- .github/skills/sdd-build-loop/SKILL.md | 64 +++-------- .github/skills/spec-sharpen-loop/SKILL.md | 41 ++----- .windsurf/skills/loop-engineering/SKILL.md | 112 +++++-------------- .windsurf/skills/pr-quality-gate/SKILL.md | 53 +++------ .windsurf/skills/pr-review-response/SKILL.md | 39 ++----- .windsurf/skills/sdd-build-loop/SKILL.md | 64 +++-------- .windsurf/skills/spec-sharpen-loop/SKILL.md | 41 ++----- CHANGELOG.md | 8 +- README.md | 15 +-- docs/loop-engineering.md | 56 ++-------- 18 files changed, 266 insertions(+), 740 deletions(-) diff --git a/.claude/skills/loop-engineering/SKILL.md b/.claude/skills/loop-engineering/SKILL.md index 9b12ec5..a1d8b53 100644 --- a/.claude/skills/loop-engineering/SKILL.md +++ b/.claude/skills/loop-engineering/SKILL.md @@ -15,37 +15,20 @@ authoritative_references: # Loop engineering -> A loop is not "run the agent again." A loop is a control system: observe the -> world, compare it to a goal, act to close the gap, observe again. The skill is -> not in the acting — agents are already good at acting. The skill is in defining -> *what the agent observes* and *when it is allowed to stop*. +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. ## The six parts of every loop -If you can name all six for your task, you have a loop. If you cannot, you have a -prompt you are running by hand. - -- **Goal / success criteria.** A precise, checkable definition of "done" that a - script could answer yes or no to. Not "make the PR good" — "all checks green, - coverage ≥ 90%, zero new static-analysis issues". -- **Sensors.** How the agent observes current state. This is the part people skip - and the part that matters most; a loop is only as good as what it can measure. - CI status, a coverage report, a Sonar API response, the output of a test run - repeated twenty times. -- **Action.** What the agent changes when state does not match the goal. Fix the - code, write the missing test, bump the dependency, commit, push. -- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced - (drain a local queue as fast as the work allows), or human-paced (wait on an - answer). The wrong cadence either wastes money or misses the signal. -- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock - deadline, or a token cap. This is what prevents a loop from grinding forever on - a problem it cannot solve. -- **Guardrails.** The rules about what the agent must *not* do. Never merge. - Never force-push. Never touch files outside the change. Never disable a failing - test to make it pass. - -Most failed loops have a clear goal and good sensors, then run away because nobody -defined the **budget** or the **guardrails**. Those last two are non-negotiable. +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. ## The shape of a loop @@ -78,15 +61,11 @@ defined the **budget** or the **guardrails**. Those last two are non-negotiable. └──────────────────┘ ``` -The escalation branch is not a failure — it is the loop doing its job. A loop -that knows when to hand the problem back is more useful than one that pretends it -can solve everything. +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. ## Four loop shapes -The outer structure (observe → compare → act → check budget → repeat) is the -same. What differs is the definition of *progress*. Tune each shape to its own -failure mode. +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. | Shape | "Done" means | Failure mode to watch | | --- | --- | --- | @@ -97,39 +76,28 @@ failure mode. ### Converge-to-green -State is binary per check. Re-run **all** sensors after each change, not just the -one you were working on, so you notice when you traded one red for another. +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. ### Incremental-to-a-number -Always attack the lowest-scoring item next. Demand *meaningful* change — an agent -told to "increase coverage" will write tests that execute code without asserting -anything, because that moves the metric. Your success criteria must forbid it. +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. ### Queue / batch -Drain a work queue one item at a time; "done" is an empty queue. Give each item -its own small iteration cap. When an item exhausts it, **skip it, leave it in the -queue, and report it** — do not let one hard item block the easy ones. +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. ### Repeat-until-confident -The signal is non-deterministic, so a single observation tells you almost -nothing. "Done" is *passed N consecutive runs*, where N grows with how -intermittent the original failure was. Never declare victory after one run. +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. ## The four SDD loops These wrap the existing SDD commands, which already supply the sensors: -1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes - `/spec-review` with zero open questions. Human-paced. -2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED - feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a - converge-to-green (fix until `/validate` and `/review` are clean). +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). 3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. -4. **Review-response** (`pr-review-response`) — green PR → every actionable - reviewer thread addressed. Polled on review activity. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. ## Composing them: issue → merged @@ -148,41 +116,21 @@ These wrap the existing SDD commands, which already supply the sensors: human merges ``` -Human judgment lands at exactly two points: answering genuine product questions -at the front, and deciding a branch is worth a PR in the middle. Everything else -is a loop converging on a checkable goal. Each loop's cadence is dictated by how -fast the thing it watches can actually change — match cadence to your sensors and -you never pay for a check that cannot tell you anything new. +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. ## Guardrails for any loop -- **Bound everything.** Every loop needs a maximum iteration count, ideally a - wall-clock deadline too. A loop without a budget is a runaway process. -- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns - the merge. -- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict - will try the easy way — disabling a flaky test, writing assertion-free tests, - editing the spec so `/review` stops complaining. Spell out the forbidden - shortcuts; the agent will not infer them. -- **Make every action reviewable and reversible.** One logical change per commit, - conventional messages, branch only, never force-push. -- **Escalate on low confidence.** "I tried these three fixes and none worked, - here is what I observed" beats a tenth desperate commit. -- **Match cadence to your sensors.** Poll on an interval only when waiting on - something slow like CI. When sensors are local, run as fast as the work allows. +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. ## When to use a loop, and when not to -Use a loop when three things are true: the goal is **objectively checkable** (a -sensor returns yes/no), progress is **incremental** (each iteration gets -measurably closer), and the work is **tedious enough** that convergence dominates -the thinking. +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. -A loop is the *wrong* tool when the goal is subjective ("make this API design -elegant"), when there is no reliable sensor, or when the hard part is a single -decision rather than a sequence of mechanical steps. If you cannot write the -"are we done?" check as something that returns a boolean, you do not have a loop -yet — you have a conversation, and you should just have the conversation. +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. -The criteria *is* the work. Once you can state precisely what "done" means and how -a sensor would detect it, the loop almost writes itself. +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.claude/skills/pr-quality-gate/SKILL.md b/.claude/skills/pr-quality-gate/SKILL.md index 963a025..f27cfc9 100644 --- a/.claude/skills/pr-quality-gate/SKILL.md +++ b/.claude/skills/pr-quality-gate/SKILL.md @@ -14,25 +14,18 @@ authoritative_references: # PR quality gate (the ship loop) -> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. -> Cadence: **polled on CI** (the sensor only changes once per CI run). -> Budget: **10 iterations**, then escalate. +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. -You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per -invocation. The loop runner (`/loop`) supplies the cadence and the budget; this -skill supplies the goal, the sensors, the action, and the guardrails. +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. ## Sensors — gather current state first -Re-run *all* of these every pass, not just the gate you fixed last time, so you -catch a fix that traded one red for another: +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: - `gh pr checks $1` — CI check status (unit, integration, architecture tests). - `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. -- The coverage report from the latest CI run artifacts (or - `./.github/scripts/check-new-code-coverage.sh` locally). -- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues - introduced by this PR. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. ## Success criteria (the goal) @@ -40,53 +33,39 @@ The PR is merge-ready ONLY when ALL of these are true: - Every CI check is passing. - Checkstyle / static analysis reports zero new violations on this PR. -- Line coverage on new code is at or above the project threshold (see - `jacoco-coverage-policy`). +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). - Sonar (if configured) reports zero new issues on this PR. -When all are true, post a comment `✅ All gates green — ready to merge` and -**STOP**. Do not merge the PR yourself. +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. ## Action — fix exactly what is failing -For each failing gate, make the **smallest correct change that fixes the root -cause**, then commit with a conventional-commit message and push to the PR branch. +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. -- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run - *all* sensors so a fix for one gate has not broken another. -- **Coverage** (incremental-to-a-number): find the class furthest below the line, - write *meaningful* tests for its uncovered branches, re-check the number, - repeat with the next-lowest file. Tests must assert behavior, not just execute - lines. +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. ## Guardrails — non-negotiable - NEVER merge, approve, or close the PR. - NEVER force-push or rewrite history. - NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. -- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to - silence a finding instead of fixing it. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. - NEVER touch files unrelated to this PR's diff. -- If a failure is ambiguous, or you are not confident in the fix, post a comment - explaining what you observed and what you tried, then STOP for a human. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. ## Cadence and budget (the loop wrapper) -Poll roughly as often as CI returns — there is no point checking every thirty -seconds when the signal only changes once per CI run: +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: ```text /loop 10m /pr-quality-gate 1234 ``` -Stop when the PR posts the "ready to merge" comment, **or** after **10 -iterations**, whichever comes first. On the tenth iteration without success, -summarize what is still failing and what you tried, then stop and escalate. +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. -On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke -this skill once per CI cycle and stop on the same two conditions. +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. ## Handoff -A green PR is the input the `pr-review-response` loop wants. When a human reviewer -leaves comments, switch to that loop to work them to zero. +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.claude/skills/pr-review-response/SKILL.md b/.claude/skills/pr-review-response/SKILL.md index 29c0471..1ca1ee1 100644 --- a/.claude/skills/pr-review-response/SKILL.md +++ b/.claude/skills/pr-review-response/SKILL.md @@ -14,47 +14,33 @@ authoritative_references: # PR review-response loop (end of the pipeline) -> Loop shape: **queue** — drain the unresolved review threads one at a time. -> The mirror image of the ship loop: that one converges the PR on CI, this one -> converges it on the humans. -> Cadence: **polled on review activity** (reviewers comment over hours, not -> seconds). -> Budget: escalate on disagreement or ambiguity; never argue in a thread. +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. -You are responding to human review on PR **#$1**. Do exactly **one pass** per -invocation. +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. ## Sensors -- `gh pr view $1 --json reviews,comments,reviewThreads` — list every - **UNRESOLVED** thread. +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. - `gh pr checks $1` — current gate status. ## Triage each unresolved thread -- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest - correct change, commit, push, and reply in the thread linking the commit. Do - NOT resolve the thread — the reviewer does that. -- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change - code only if the honest answer is "you're right, fixing it". -- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not - argue in the thread, and do not silently comply. +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. -After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review -fix does not re-break CI. +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. ## Success criteria (the goal) -No unresolved actionable threads remain AND all gates are green. Then STOP. Do not -merge. +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. ## Guardrails — non-negotiable - NEVER merge, approve, or resolve a reviewer's thread. - NEVER push back on a reviewer; escalate disagreements to your human. - One logical change per commit; never force-push. -- If a comment needs a product or architecture decision, escalate instead of - deciding it inside the loop. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. ## Cadence (the loop wrapper) @@ -64,9 +50,4 @@ Poll on an interval, because the sensor is human review activity: /loop 15m /pr-review-response 1234 ``` -The hard part is not the editing, it is the **triage**. A loop that treats every -comment as "change requested" will rewrite code in response to a reviewer who was -only asking a question; a loop that treats every comment as a question will ignore -real change requests. Spelling out the difference, and forcing an escalation on -disagreement, is what keeps the loop from arguing with your reviewer on your -behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.claude/skills/sdd-build-loop/SKILL.md b/.claude/skills/sdd-build-loop/SKILL.md index b4d2416..4439109 100644 --- a/.claude/skills/sdd-build-loop/SKILL.md +++ b/.claude/skills/sdd-build-loop/SKILL.md @@ -15,72 +15,46 @@ authoritative_references: # SDD build loop (the build loop) -> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** -> (fix until `/validate` and `/review` are clean). -> Cadence: **self-paced** — no CI to wait on; run the next step the moment the -> last one finishes. -> Budget: a cap on review-fix rounds, then escalate. +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. -You are building the feature in `.specs/$1` to a reviewed, committable state on -the **current feature branch**. Confirm you are **NOT on `main`** before doing -anything. The framework *is* the sensor layer here: `.tdd-state.json`, the -`/validate` output, and the `/review` verdict. +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. ## Phase 1 — drain the task queue (queue shape) 1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. -2. Run `/build` for that task. Let it move `pending → red → green → refactor → - done` and commit on completion (one task per commit). +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). 3. Repeat until no `pending` tasks remain. -4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip - it silently — a missing task is not a done feature. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. ## Phase 2 — validate and review (converge-to-green) -1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix - the root cause, commit, and run `/validate` again. -2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the - smallest correct change, commit, and run `/review` again. -3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero - must-fix findings. Leave should-fix and nit findings in the handoff summary for - the human — do not gold-plate. +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. ## Success criteria (the goal) -All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP -and write a handoff summary listing the should-fix and nit items you deliberately -left. **DO NOT open a PR or merge** — that is the human's call. +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. ## Guardrails — non-negotiable - Work ONLY on the feature branch. NEVER commit to `main`. - NEVER merge, and NEVER open the PR. -- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to - make a gate pass. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. - One logical change per commit; conventional messages; never force-push. -- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with - what is still failing and what you tried. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. ## The trust you have to earn -This loop does not *remove* human review — it **relocates** it. It replaces the -per-task checkpoint with automated sensors (`/validate`, `/review`) on the small -stuff and one deliberate human validation round at the end. That trade is only -safe once the sensors are trustworthy enough to stand in for you. A feature is -allowed through this loop only when: +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: -- You have shipped enough features with the framework that `/review` rarely - surfaces something you disagree with. +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. - Your skills, agents, and review rubric encode *your* team's standards. -- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the - mistakes you care about — a green `/validate` actually *means* something. -- The spec passed `/spec-review` with no open questions. The loop amplifies - whatever the spec says; a vague spec produces confident, wrong code faster. -- The feature is a pattern you have done before (a CRUD slice with a clear - contract), not a novel architectural decision. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. -If those conditions are not met, run the framework the normal way, task by task. -The loop is an optimization for the cases you already understand. +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. ## Cadence and handoff (the loop wrapper) @@ -90,8 +64,4 @@ Because there is no CI to wait on, this loop self-paces: /loop /sdd-build-loop 2026-05-09-create-new-customer ``` -When it stops you have a feature branch with a drained task queue, a clean -`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the -input the human PR gate wants. The human runs the app, reads the diff, and -decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop -takes over. +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.claude/skills/spec-sharpen-loop/SKILL.md b/.claude/skills/spec-sharpen-loop/SKILL.md index 81733c2..b279c06 100644 --- a/.claude/skills/spec-sharpen-loop/SKILL.md +++ b/.claude/skills/spec-sharpen-loop/SKILL.md @@ -14,16 +14,9 @@ authoritative_references: # Spec-sharpen loop (front of the pipeline) -> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a -> human triage step. -> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a -> round each time you answer the product questions it surfaces, then blocks again. -> Budget: escalate if the same question reappears after being answered. +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. -You are turning issue **#$1** into a spec that passes `/spec-review` with zero -open questions. Work in rounds. The build loop downstream is only safe if the spec -going in is sharp: a vague spec does not produce vague code, it produces -confident, wrong code, faster. +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. ## Sensors @@ -34,31 +27,20 @@ confident, wrong code, faster. 1. Run `/spec` for the issue (or re-run after answers have been added). 2. Triage **every** open question into one of two buckets: - - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic - acceptance criteria, missing non-goals, ambiguous wording, contract or - format gaps. Resolve these and update the spec. - - **PRODUCT DECISION** you cannot ground in existing material — pricing, - policy, UX intent, scope trade-offs. Collect these. -3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human - as a numbered list. Wait for answers; do not invent them. -4. Run `/spec-review`. If FAIL, address the structural findings and start another - round. + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. ## Success criteria (the goal) -`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — -the spec is ready for `/plan` and the build loop. This loop produces a spec and -nothing else. +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. ## Guardrails — non-negotiable -- NEVER answer a product-decision question yourself to force a PASS. A clean spec - built on guessed intent is the most expensive failure in the whole pipeline — - every downstream loop trusts this spec, and will build, test, and ship the wrong - thing with great efficiency. +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. - NEVER start design or code. No `/plan`, no `/build`. -- If the same question reappears after being answered, escalate; the issue itself - may be underspecified. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. ## Cadence (the loop wrapper) @@ -68,7 +50,4 @@ This is the one loop that pauses inside itself for a human on purpose: /loop /spec-sharpen-loop 1 ``` -It is human-paced by design. You are not waiting on a clock or on CI; the loop is -waiting on *your* product decisions. Answer them, and it advances a round; until -you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the -skill after you have answered the questions it raised. +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/.github/skills/loop-engineering/SKILL.md b/.github/skills/loop-engineering/SKILL.md index 442a87f..61548f2 100644 --- a/.github/skills/loop-engineering/SKILL.md +++ b/.github/skills/loop-engineering/SKILL.md @@ -15,37 +15,20 @@ authoritative_references: # Loop engineering -> A loop is not "run the agent again." A loop is a control system: observe the -> world, compare it to a goal, act to close the gap, observe again. The skill is -> not in the acting — agents are already good at acting. The skill is in defining -> *what the agent observes* and *when it is allowed to stop*. +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. ## The six parts of every loop -If you can name all six for your task, you have a loop. If you cannot, you have a -prompt you are running by hand. - -- **Goal / success criteria.** A precise, checkable definition of "done" that a - script could answer yes or no to. Not "make the PR good" — "all checks green, - coverage ≥ 90%, zero new static-analysis issues". -- **Sensors.** How the agent observes current state. This is the part people skip - and the part that matters most; a loop is only as good as what it can measure. - CI status, a coverage report, a Sonar API response, the output of a test run - repeated twenty times. -- **Action.** What the agent changes when state does not match the goal. Fix the - code, write the missing test, bump the dependency, commit, push. -- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced - (drain a local queue as fast as the work allows), or human-paced (wait on an - answer). The wrong cadence either wastes money or misses the signal. -- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock - deadline, or a token cap. This is what prevents a loop from grinding forever on - a problem it cannot solve. -- **Guardrails.** The rules about what the agent must *not* do. Never merge. - Never force-push. Never touch files outside the change. Never disable a failing - test to make it pass. - -Most failed loops have a clear goal and good sensors, then run away because nobody -defined the **budget** or the **guardrails**. Those last two are non-negotiable. +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. ## The shape of a loop @@ -78,15 +61,11 @@ defined the **budget** or the **guardrails**. Those last two are non-negotiable. └──────────────────┘ ``` -The escalation branch is not a failure — it is the loop doing its job. A loop -that knows when to hand the problem back is more useful than one that pretends it -can solve everything. +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. ## Four loop shapes -The outer structure (observe → compare → act → check budget → repeat) is the -same. What differs is the definition of *progress*. Tune each shape to its own -failure mode. +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. | Shape | "Done" means | Failure mode to watch | | --- | --- | --- | @@ -97,39 +76,28 @@ failure mode. ### Converge-to-green -State is binary per check. Re-run **all** sensors after each change, not just the -one you were working on, so you notice when you traded one red for another. +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. ### Incremental-to-a-number -Always attack the lowest-scoring item next. Demand *meaningful* change — an agent -told to "increase coverage" will write tests that execute code without asserting -anything, because that moves the metric. Your success criteria must forbid it. +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. ### Queue / batch -Drain a work queue one item at a time; "done" is an empty queue. Give each item -its own small iteration cap. When an item exhausts it, **skip it, leave it in the -queue, and report it** — do not let one hard item block the easy ones. +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. ### Repeat-until-confident -The signal is non-deterministic, so a single observation tells you almost -nothing. "Done" is *passed N consecutive runs*, where N grows with how -intermittent the original failure was. Never declare victory after one run. +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. ## The four SDD loops These wrap the existing SDD commands, which already supply the sensors: -1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes - `/spec-review` with zero open questions. Human-paced. -2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED - feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a - converge-to-green (fix until `/validate` and `/review` are clean). +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). 3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. -4. **Review-response** (`pr-review-response`) — green PR → every actionable - reviewer thread addressed. Polled on review activity. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. ## Composing them: issue → merged @@ -148,41 +116,21 @@ These wrap the existing SDD commands, which already supply the sensors: human merges ``` -Human judgment lands at exactly two points: answering genuine product questions -at the front, and deciding a branch is worth a PR in the middle. Everything else -is a loop converging on a checkable goal. Each loop's cadence is dictated by how -fast the thing it watches can actually change — match cadence to your sensors and -you never pay for a check that cannot tell you anything new. +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. ## Guardrails for any loop -- **Bound everything.** Every loop needs a maximum iteration count, ideally a - wall-clock deadline too. A loop without a budget is a runaway process. -- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns - the merge. -- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict - will try the easy way — disabling a flaky test, writing assertion-free tests, - editing the spec so `/review` stops complaining. Spell out the forbidden - shortcuts; the agent will not infer them. -- **Make every action reviewable and reversible.** One logical change per commit, - conventional messages, branch only, never force-push. -- **Escalate on low confidence.** "I tried these three fixes and none worked, - here is what I observed" beats a tenth desperate commit. -- **Match cadence to your sensors.** Poll on an interval only when waiting on - something slow like CI. When sensors are local, run as fast as the work allows. +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. ## When to use a loop, and when not to -Use a loop when three things are true: the goal is **objectively checkable** (a -sensor returns yes/no), progress is **incremental** (each iteration gets -measurably closer), and the work is **tedious enough** that convergence dominates -the thinking. +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. -A loop is the *wrong* tool when the goal is subjective ("make this API design -elegant"), when there is no reliable sensor, or when the hard part is a single -decision rather than a sequence of mechanical steps. If you cannot write the -"are we done?" check as something that returns a boolean, you do not have a loop -yet — you have a conversation, and you should just have the conversation. +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. -The criteria *is* the work. Once you can state precisely what "done" means and how -a sensor would detect it, the loop almost writes itself. +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.github/skills/pr-quality-gate/SKILL.md b/.github/skills/pr-quality-gate/SKILL.md index 22a42f1..50afbd4 100644 --- a/.github/skills/pr-quality-gate/SKILL.md +++ b/.github/skills/pr-quality-gate/SKILL.md @@ -14,25 +14,18 @@ authoritative_references: # PR quality gate (the ship loop) -> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. -> Cadence: **polled on CI** (the sensor only changes once per CI run). -> Budget: **10 iterations**, then escalate. +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. -You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per -invocation. The loop runner (`/loop`) supplies the cadence and the budget; this -skill supplies the goal, the sensors, the action, and the guardrails. +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. ## Sensors — gather current state first -Re-run *all* of these every pass, not just the gate you fixed last time, so you -catch a fix that traded one red for another: +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: - `gh pr checks $1` — CI check status (unit, integration, architecture tests). - `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. -- The coverage report from the latest CI run artifacts (or - `./.github/scripts/check-new-code-coverage.sh` locally). -- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues - introduced by this PR. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. ## Success criteria (the goal) @@ -40,53 +33,39 @@ The PR is merge-ready ONLY when ALL of these are true: - Every CI check is passing. - Checkstyle / static analysis reports zero new violations on this PR. -- Line coverage on new code is at or above the project threshold (see - `jacoco-coverage-policy`). +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). - Sonar (if configured) reports zero new issues on this PR. -When all are true, post a comment `✅ All gates green — ready to merge` and -**STOP**. Do not merge the PR yourself. +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. ## Action — fix exactly what is failing -For each failing gate, make the **smallest correct change that fixes the root -cause**, then commit with a conventional-commit message and push to the PR branch. +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. -- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run - *all* sensors so a fix for one gate has not broken another. -- **Coverage** (incremental-to-a-number): find the class furthest below the line, - write *meaningful* tests for its uncovered branches, re-check the number, - repeat with the next-lowest file. Tests must assert behavior, not just execute - lines. +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. ## Guardrails — non-negotiable - NEVER merge, approve, or close the PR. - NEVER force-push or rewrite history. - NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. -- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to - silence a finding instead of fixing it. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. - NEVER touch files unrelated to this PR's diff. -- If a failure is ambiguous, or you are not confident in the fix, post a comment - explaining what you observed and what you tried, then STOP for a human. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. ## Cadence and budget (the loop wrapper) -Poll roughly as often as CI returns — there is no point checking every thirty -seconds when the signal only changes once per CI run: +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: ```text /loop 10m /pr-quality-gate 1234 ``` -Stop when the PR posts the "ready to merge" comment, **or** after **10 -iterations**, whichever comes first. On the tenth iteration without success, -summarize what is still failing and what you tried, then stop and escalate. +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. -On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke -this skill once per CI cycle and stop on the same two conditions. +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. ## Handoff -A green PR is the input the `pr-review-response` loop wants. When a human reviewer -leaves comments, switch to that loop to work them to zero. +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.github/skills/pr-review-response/SKILL.md b/.github/skills/pr-review-response/SKILL.md index 0738a3a..b7ad2f1 100644 --- a/.github/skills/pr-review-response/SKILL.md +++ b/.github/skills/pr-review-response/SKILL.md @@ -14,47 +14,33 @@ authoritative_references: # PR review-response loop (end of the pipeline) -> Loop shape: **queue** — drain the unresolved review threads one at a time. -> The mirror image of the ship loop: that one converges the PR on CI, this one -> converges it on the humans. -> Cadence: **polled on review activity** (reviewers comment over hours, not -> seconds). -> Budget: escalate on disagreement or ambiguity; never argue in a thread. +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. -You are responding to human review on PR **#$1**. Do exactly **one pass** per -invocation. +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. ## Sensors -- `gh pr view $1 --json reviews,comments,reviewThreads` — list every - **UNRESOLVED** thread. +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. - `gh pr checks $1` — current gate status. ## Triage each unresolved thread -- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest - correct change, commit, push, and reply in the thread linking the commit. Do - NOT resolve the thread — the reviewer does that. -- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change - code only if the honest answer is "you're right, fixing it". -- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not - argue in the thread, and do not silently comply. +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. -After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review -fix does not re-break CI. +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. ## Success criteria (the goal) -No unresolved actionable threads remain AND all gates are green. Then STOP. Do not -merge. +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. ## Guardrails — non-negotiable - NEVER merge, approve, or resolve a reviewer's thread. - NEVER push back on a reviewer; escalate disagreements to your human. - One logical change per commit; never force-push. -- If a comment needs a product or architecture decision, escalate instead of - deciding it inside the loop. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. ## Cadence (the loop wrapper) @@ -64,9 +50,4 @@ Poll on an interval, because the sensor is human review activity: /loop 15m /pr-review-response 1234 ``` -The hard part is not the editing, it is the **triage**. A loop that treats every -comment as "change requested" will rewrite code in response to a reviewer who was -only asking a question; a loop that treats every comment as a question will ignore -real change requests. Spelling out the difference, and forcing an escalation on -disagreement, is what keeps the loop from arguing with your reviewer on your -behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.github/skills/sdd-build-loop/SKILL.md b/.github/skills/sdd-build-loop/SKILL.md index 5ca435b..10456b3 100644 --- a/.github/skills/sdd-build-loop/SKILL.md +++ b/.github/skills/sdd-build-loop/SKILL.md @@ -15,72 +15,46 @@ authoritative_references: # SDD build loop (the build loop) -> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** -> (fix until `/validate` and `/review` are clean). -> Cadence: **self-paced** — no CI to wait on; run the next step the moment the -> last one finishes. -> Budget: a cap on review-fix rounds, then escalate. +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. -You are building the feature in `.specs/$1` to a reviewed, committable state on -the **current feature branch**. Confirm you are **NOT on `main`** before doing -anything. The framework *is* the sensor layer here: `.tdd-state.json`, the -`/validate` output, and the `/review` verdict. +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. ## Phase 1 — drain the task queue (queue shape) 1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. -2. Run `/build` for that task. Let it move `pending → red → green → refactor → - done` and commit on completion (one task per commit). +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). 3. Repeat until no `pending` tasks remain. -4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip - it silently — a missing task is not a done feature. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. ## Phase 2 — validate and review (converge-to-green) -1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix - the root cause, commit, and run `/validate` again. -2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the - smallest correct change, commit, and run `/review` again. -3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero - must-fix findings. Leave should-fix and nit findings in the handoff summary for - the human — do not gold-plate. +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. ## Success criteria (the goal) -All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP -and write a handoff summary listing the should-fix and nit items you deliberately -left. **DO NOT open a PR or merge** — that is the human's call. +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. ## Guardrails — non-negotiable - Work ONLY on the feature branch. NEVER commit to `main`. - NEVER merge, and NEVER open the PR. -- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to - make a gate pass. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. - One logical change per commit; conventional messages; never force-push. -- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with - what is still failing and what you tried. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. ## The trust you have to earn -This loop does not *remove* human review — it **relocates** it. It replaces the -per-task checkpoint with automated sensors (`/validate`, `/review`) on the small -stuff and one deliberate human validation round at the end. That trade is only -safe once the sensors are trustworthy enough to stand in for you. A feature is -allowed through this loop only when: +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: -- You have shipped enough features with the framework that `/review` rarely - surfaces something you disagree with. +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. - Your skills, agents, and review rubric encode *your* team's standards. -- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the - mistakes you care about — a green `/validate` actually *means* something. -- The spec passed `/spec-review` with no open questions. The loop amplifies - whatever the spec says; a vague spec produces confident, wrong code faster. -- The feature is a pattern you have done before (a CRUD slice with a clear - contract), not a novel architectural decision. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. -If those conditions are not met, run the framework the normal way, task by task. -The loop is an optimization for the cases you already understand. +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. ## Cadence and handoff (the loop wrapper) @@ -90,8 +64,4 @@ Because there is no CI to wait on, this loop self-paces: /loop /sdd-build-loop 2026-05-09-create-new-customer ``` -When it stops you have a feature branch with a drained task queue, a clean -`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the -input the human PR gate wants. The human runs the app, reads the diff, and -decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop -takes over. +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.github/skills/spec-sharpen-loop/SKILL.md b/.github/skills/spec-sharpen-loop/SKILL.md index c649589..00edc6e 100644 --- a/.github/skills/spec-sharpen-loop/SKILL.md +++ b/.github/skills/spec-sharpen-loop/SKILL.md @@ -14,16 +14,9 @@ authoritative_references: # Spec-sharpen loop (front of the pipeline) -> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a -> human triage step. -> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a -> round each time you answer the product questions it surfaces, then blocks again. -> Budget: escalate if the same question reappears after being answered. +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. -You are turning issue **#$1** into a spec that passes `/spec-review` with zero -open questions. Work in rounds. The build loop downstream is only safe if the spec -going in is sharp: a vague spec does not produce vague code, it produces -confident, wrong code, faster. +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. ## Sensors @@ -34,31 +27,20 @@ confident, wrong code, faster. 1. Run `/spec` for the issue (or re-run after answers have been added). 2. Triage **every** open question into one of two buckets: - - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic - acceptance criteria, missing non-goals, ambiguous wording, contract or - format gaps. Resolve these and update the spec. - - **PRODUCT DECISION** you cannot ground in existing material — pricing, - policy, UX intent, scope trade-offs. Collect these. -3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human - as a numbered list. Wait for answers; do not invent them. -4. Run `/spec-review`. If FAIL, address the structural findings and start another - round. + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. ## Success criteria (the goal) -`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — -the spec is ready for `/plan` and the build loop. This loop produces a spec and -nothing else. +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. ## Guardrails — non-negotiable -- NEVER answer a product-decision question yourself to force a PASS. A clean spec - built on guessed intent is the most expensive failure in the whole pipeline — - every downstream loop trusts this spec, and will build, test, and ship the wrong - thing with great efficiency. +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. - NEVER start design or code. No `/plan`, no `/build`. -- If the same question reappears after being answered, escalate; the issue itself - may be underspecified. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. ## Cadence (the loop wrapper) @@ -68,7 +50,4 @@ This is the one loop that pauses inside itself for a human on purpose: /loop /spec-sharpen-loop 1 ``` -It is human-paced by design. You are not waiting on a clock or on CI; the loop is -waiting on *your* product decisions. Answer them, and it advances a round; until -you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the -skill after you have answered the questions it raised. +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/.windsurf/skills/loop-engineering/SKILL.md b/.windsurf/skills/loop-engineering/SKILL.md index 49af004..2137214 100644 --- a/.windsurf/skills/loop-engineering/SKILL.md +++ b/.windsurf/skills/loop-engineering/SKILL.md @@ -15,37 +15,20 @@ authoritative_references: # Loop engineering -> A loop is not "run the agent again." A loop is a control system: observe the -> world, compare it to a goal, act to close the gap, observe again. The skill is -> not in the acting — agents are already good at acting. The skill is in defining -> *what the agent observes* and *when it is allowed to stop*. +> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. ## The six parts of every loop -If you can name all six for your task, you have a loop. If you cannot, you have a -prompt you are running by hand. - -- **Goal / success criteria.** A precise, checkable definition of "done" that a - script could answer yes or no to. Not "make the PR good" — "all checks green, - coverage ≥ 90%, zero new static-analysis issues". -- **Sensors.** How the agent observes current state. This is the part people skip - and the part that matters most; a loop is only as good as what it can measure. - CI status, a coverage report, a Sonar API response, the output of a test run - repeated twenty times. -- **Action.** What the agent changes when state does not match the goal. Fix the - code, write the missing test, bump the dependency, commit, push. -- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced - (drain a local queue as fast as the work allows), or human-paced (wait on an - answer). The wrong cadence either wastes money or misses the signal. -- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock - deadline, or a token cap. This is what prevents a loop from grinding forever on - a problem it cannot solve. -- **Guardrails.** The rules about what the agent must *not* do. Never merge. - Never force-push. Never touch files outside the change. Never disable a failing - test to make it pass. - -Most failed loops have a clear goal and good sensors, then run away because nobody -defined the **budget** or the **guardrails**. Those last two are non-negotiable. +If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. + +- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues". +- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times. +- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push. +- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal. +- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve. +- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass. + +Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable. ## The shape of a loop @@ -78,15 +61,11 @@ defined the **budget** or the **guardrails**. Those last two are non-negotiable. └──────────────────┘ ``` -The escalation branch is not a failure — it is the loop doing its job. A loop -that knows when to hand the problem back is more useful than one that pretends it -can solve everything. +The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything. ## Four loop shapes -The outer structure (observe → compare → act → check budget → repeat) is the -same. What differs is the definition of *progress*. Tune each shape to its own -failure mode. +The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode. | Shape | "Done" means | Failure mode to watch | | --- | --- | --- | @@ -97,39 +76,28 @@ failure mode. ### Converge-to-green -State is binary per check. Re-run **all** sensors after each change, not just the -one you were working on, so you notice when you traded one red for another. +State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another. ### Incremental-to-a-number -Always attack the lowest-scoring item next. Demand *meaningful* change — an agent -told to "increase coverage" will write tests that execute code without asserting -anything, because that moves the metric. Your success criteria must forbid it. +Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it. ### Queue / batch -Drain a work queue one item at a time; "done" is an empty queue. Give each item -its own small iteration cap. When an item exhausts it, **skip it, leave it in the -queue, and report it** — do not let one hard item block the easy ones. +Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones. ### Repeat-until-confident -The signal is non-deterministic, so a single observation tells you almost -nothing. "Done" is *passed N consecutive runs*, where N grows with how -intermittent the original failure was. Never declare victory after one run. +The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run. ## The four SDD loops These wrap the existing SDD commands, which already supply the sensors: -1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes - `/spec-review` with zero open questions. Human-paced. -2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED - feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a - converge-to-green (fix until `/validate` and `/review` are clean). +1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced. +2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean). 3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI. -4. **Review-response** (`pr-review-response`) — green PR → every actionable - reviewer thread addressed. Polled on review activity. +4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity. ## Composing them: issue → merged @@ -148,41 +116,21 @@ These wrap the existing SDD commands, which already supply the sensors: human merges ``` -Human judgment lands at exactly two points: answering genuine product questions -at the front, and deciding a branch is worth a PR in the middle. Everything else -is a loop converging on a checkable goal. Each loop's cadence is dictated by how -fast the thing it watches can actually change — match cadence to your sensors and -you never pay for a check that cannot tell you anything new. +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. ## Guardrails for any loop -- **Bound everything.** Every loop needs a maximum iteration count, ideally a - wall-clock deadline too. A loop without a budget is a runaway process. -- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns - the merge. -- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict - will try the easy way — disabling a flaky test, writing assertion-free tests, - editing the spec so `/review` stops complaining. Spell out the forbidden - shortcuts; the agent will not infer them. -- **Make every action reviewable and reversible.** One logical change per commit, - conventional messages, branch only, never force-push. -- **Escalate on low confidence.** "I tried these three fixes and none worked, - here is what I observed" beats a tenth desperate commit. -- **Match cadence to your sensors.** Poll on an interval only when waiting on - something slow like CI. When sensors are local, run as fast as the work allows. +- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process. +- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge. +- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them. +- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push. +- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit. +- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows. ## When to use a loop, and when not to -Use a loop when three things are true: the goal is **objectively checkable** (a -sensor returns yes/no), progress is **incremental** (each iteration gets -measurably closer), and the work is **tedious enough** that convergence dominates -the thinking. +Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking. -A loop is the *wrong* tool when the goal is subjective ("make this API design -elegant"), when there is no reliable sensor, or when the hard part is a single -decision rather than a sequence of mechanical steps. If you cannot write the -"are we done?" check as something that returns a boolean, you do not have a loop -yet — you have a conversation, and you should just have the conversation. +A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. -The criteria *is* the work. Once you can state precisely what "done" means and how -a sensor would detect it, the loop almost writes itself. +The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself. diff --git a/.windsurf/skills/pr-quality-gate/SKILL.md b/.windsurf/skills/pr-quality-gate/SKILL.md index 0eb07a1..a51fc4a 100644 --- a/.windsurf/skills/pr-quality-gate/SKILL.md +++ b/.windsurf/skills/pr-quality-gate/SKILL.md @@ -14,25 +14,18 @@ authoritative_references: # PR quality gate (the ship loop) -> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. -> Cadence: **polled on CI** (the sensor only changes once per CI run). -> Budget: **10 iterations**, then escalate. +> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate. -You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per -invocation. The loop runner (`/loop`) supplies the cadence and the budget; this -skill supplies the goal, the sensors, the action, and the guardrails. +You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails. ## Sensors — gather current state first -Re-run *all* of these every pass, not just the gate you fixed last time, so you -catch a fix that traded one red for another: +Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another: - `gh pr checks $1` — CI check status (unit, integration, architecture tests). - `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability. -- The coverage report from the latest CI run artifacts (or - `./.github/scripts/check-new-code-coverage.sh` locally). -- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues - introduced by this PR. +- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally). +- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR. ## Success criteria (the goal) @@ -40,53 +33,39 @@ The PR is merge-ready ONLY when ALL of these are true: - Every CI check is passing. - Checkstyle / static analysis reports zero new violations on this PR. -- Line coverage on new code is at or above the project threshold (see - `jacoco-coverage-policy`). +- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`). - Sonar (if configured) reports zero new issues on this PR. -When all are true, post a comment `✅ All gates green — ready to merge` and -**STOP**. Do not merge the PR yourself. +When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself. ## Action — fix exactly what is failing -For each failing gate, make the **smallest correct change that fixes the root -cause**, then commit with a conventional-commit message and push to the PR branch. +For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch. -- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run - *all* sensors so a fix for one gate has not broken another. -- **Coverage** (incremental-to-a-number): find the class furthest below the line, - write *meaningful* tests for its uncovered branches, re-check the number, - repeat with the next-lowest file. Tests must assert behavior, not just execute - lines. +- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another. +- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines. ## Guardrails — non-negotiable - NEVER merge, approve, or close the PR. - NEVER force-push or rewrite history. - NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass. -- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to - silence a finding instead of fixing it. +- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it. - NEVER touch files unrelated to this PR's diff. -- If a failure is ambiguous, or you are not confident in the fix, post a comment - explaining what you observed and what you tried, then STOP for a human. +- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human. ## Cadence and budget (the loop wrapper) -Poll roughly as often as CI returns — there is no point checking every thirty -seconds when the signal only changes once per CI run: +Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run: ```text /loop 10m /pr-quality-gate 1234 ``` -Stop when the PR posts the "ready to merge" comment, **or** after **10 -iterations**, whichever comes first. On the tenth iteration without success, -summarize what is still failing and what you tried, then stop and escalate. +Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate. -On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke -this skill once per CI cycle and stop on the same two conditions. +On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions. ## Handoff -A green PR is the input the `pr-review-response` loop wants. When a human reviewer -leaves comments, switch to that loop to work them to zero. +A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero. diff --git a/.windsurf/skills/pr-review-response/SKILL.md b/.windsurf/skills/pr-review-response/SKILL.md index 703c2bc..4d04925 100644 --- a/.windsurf/skills/pr-review-response/SKILL.md +++ b/.windsurf/skills/pr-review-response/SKILL.md @@ -14,47 +14,33 @@ authoritative_references: # PR review-response loop (end of the pipeline) -> Loop shape: **queue** — drain the unresolved review threads one at a time. -> The mirror image of the ship loop: that one converges the PR on CI, this one -> converges it on the humans. -> Cadence: **polled on review activity** (reviewers comment over hours, not -> seconds). -> Budget: escalate on disagreement or ambiguity; never argue in a thread. +> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread. -You are responding to human review on PR **#$1**. Do exactly **one pass** per -invocation. +You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation. ## Sensors -- `gh pr view $1 --json reviews,comments,reviewThreads` — list every - **UNRESOLVED** thread. +- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread. - `gh pr checks $1` — current gate status. ## Triage each unresolved thread -- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest - correct change, commit, push, and reply in the thread linking the commit. Do - NOT resolve the thread — the reviewer does that. -- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change - code only if the honest answer is "you're right, fixing it". -- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not - argue in the thread, and do not silently comply. +- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that. +- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it". +- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply. -After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review -fix does not re-break CI. +After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI. ## Success criteria (the goal) -No unresolved actionable threads remain AND all gates are green. Then STOP. Do not -merge. +No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge. ## Guardrails — non-negotiable - NEVER merge, approve, or resolve a reviewer's thread. - NEVER push back on a reviewer; escalate disagreements to your human. - One logical change per commit; never force-push. -- If a comment needs a product or architecture decision, escalate instead of - deciding it inside the loop. +- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop. ## Cadence (the loop wrapper) @@ -64,9 +50,4 @@ Poll on an interval, because the sensor is human review activity: /loop 15m /pr-review-response 1234 ``` -The hard part is not the editing, it is the **triage**. A loop that treats every -comment as "change requested" will rewrite code in response to a reviewer who was -only asking a question; a loop that treats every comment as a question will ignore -real change requests. Spelling out the difference, and forcing an escalation on -disagreement, is what keeps the loop from arguing with your reviewer on your -behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. +The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive. diff --git a/.windsurf/skills/sdd-build-loop/SKILL.md b/.windsurf/skills/sdd-build-loop/SKILL.md index 9234cdc..b58aad8 100644 --- a/.windsurf/skills/sdd-build-loop/SKILL.md +++ b/.windsurf/skills/sdd-build-loop/SKILL.md @@ -15,72 +15,46 @@ authoritative_references: # SDD build loop (the build loop) -> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** -> (fix until `/validate` and `/review` are clean). -> Cadence: **self-paced** — no CI to wait on; run the next step the moment the -> last one finishes. -> Budget: a cap on review-fix rounds, then escalate. +> Loop shape: a **queue** (drain `/build` tasks) wrapping a **converge-to-green** (fix until `/validate` and `/review` are clean). Cadence: **self-paced** — no CI to wait on; run the next step the moment the last one finishes. Budget: a cap on review-fix rounds, then escalate. -You are building the feature in `.specs/$1` to a reviewed, committable state on -the **current feature branch**. Confirm you are **NOT on `main`** before doing -anything. The framework *is* the sensor layer here: `.tdd-state.json`, the -`/validate` output, and the `/review` verdict. +You are building the feature in `.specs/$1` to a reviewed, committable state on the **current feature branch**. Confirm you are **NOT on `main`** before doing anything. The framework *is* the sensor layer here: `.tdd-state.json`, the `/validate` output, and the `/review` verdict. ## Phase 1 — drain the task queue (queue shape) 1. Read `.specs/$1/.tdd-state.json` for the next task whose phase is `pending`. -2. Run `/build` for that task. Let it move `pending → red → green → refactor → - done` and commit on completion (one task per commit). +2. Run `/build` for that task. Let it move `pending → red → green → refactor → done` and commit on completion (one task per commit). 3. Repeat until no `pending` tasks remain. -4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip - it silently — a missing task is not a done feature. +4. If one task fails its gates **twice in a row**, STOP and escalate. Do not skip it silently — a missing task is not a done feature. ## Phase 2 — validate and review (converge-to-green) -1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix - the root cause, commit, and run `/validate` again. -2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the - smallest correct change, commit, and run `/review` again. -3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero - must-fix findings. Leave should-fix and nit findings in the handoff summary for - the human — do not gold-plate. +1. Run `/validate`. If anything fails (tests, coverage, lint, traceability), fix the root cause, commit, and run `/validate` again. +2. Run `/review`. If the verdict has ANY must-fix findings, fix each with the smallest correct change, commit, and run `/review` again. +3. Stop Phase 2 when `/validate` is clean AND `/review` returns APPROVE with zero must-fix findings. Leave should-fix and nit findings in the handoff summary for the human — do not gold-plate. ## Success criteria (the goal) -All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP -and write a handoff summary listing the should-fix and nit items you deliberately -left. **DO NOT open a PR or merge** — that is the human's call. +All tasks done, `/validate` clean, `/review` APPROVE with zero must-fix. Then STOP and write a handoff summary listing the should-fix and nit items you deliberately left. **DO NOT open a PR or merge** — that is the human's call. ## Guardrails — non-negotiable - Work ONLY on the feature branch. NEVER commit to `main`. - NEVER merge, and NEVER open the PR. -- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to - make a gate pass. +- NEVER edit the spec, weaken a test, lower a threshold, or add a suppression to make a gate pass. - One logical change per commit; conventional messages; never force-push. -- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with - what is still failing and what you tried. +- After **5 review-fix rounds** without reaching APPROVE, STOP and escalate with what is still failing and what you tried. ## The trust you have to earn -This loop does not *remove* human review — it **relocates** it. It replaces the -per-task checkpoint with automated sensors (`/validate`, `/review`) on the small -stuff and one deliberate human validation round at the end. That trade is only -safe once the sensors are trustworthy enough to stand in for you. A feature is -allowed through this loop only when: +This loop does not *remove* human review — it **relocates** it. It replaces the per-task checkpoint with automated sensors (`/validate`, `/review`) on the small stuff and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you. A feature is allowed through this loop only when: -- You have shipped enough features with the framework that `/review` rarely - surfaces something you disagree with. +- You have shipped enough features with the framework that `/review` rarely surfaces something you disagree with. - Your skills, agents, and review rubric encode *your* team's standards. -- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the - mistakes you care about — a green `/validate` actually *means* something. -- The spec passed `/spec-review` with no open questions. The loop amplifies - whatever the spec says; a vague spec produces confident, wrong code faster. -- The feature is a pattern you have done before (a CRUD slice with a clear - contract), not a novel architectural decision. +- Your coverage threshold, lint rules, and `/review` checklist fail loudly on the mistakes you care about — a green `/validate` actually *means* something. +- The spec passed `/spec-review` with no open questions. The loop amplifies whatever the spec says; a vague spec produces confident, wrong code faster. +- The feature is a pattern you have done before (a CRUD slice with a clear contract), not a novel architectural decision. -If those conditions are not met, run the framework the normal way, task by task. -The loop is an optimization for the cases you already understand. +If those conditions are not met, run the framework the normal way, task by task. The loop is an optimization for the cases you already understand. ## Cadence and handoff (the loop wrapper) @@ -90,8 +64,4 @@ Because there is no CI to wait on, this loop self-paces: /loop /sdd-build-loop 2026-05-09-create-new-customer ``` -When it stops you have a feature branch with a drained task queue, a clean -`/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the -input the human PR gate wants. The human runs the app, reads the diff, and -decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop -takes over. +When it stops you have a feature branch with a drained task queue, a clean `/validate`, an `APPROVE` from `/review`, and a handoff summary — exactly the input the human PR gate wants. The human runs the app, reads the diff, and decides whether to open the PR. Once the PR exists, the `pr-quality-gate` loop takes over. diff --git a/.windsurf/skills/spec-sharpen-loop/SKILL.md b/.windsurf/skills/spec-sharpen-loop/SKILL.md index 4d12a40..b7386b3 100644 --- a/.windsurf/skills/spec-sharpen-loop/SKILL.md +++ b/.windsurf/skills/spec-sharpen-loop/SKILL.md @@ -14,16 +14,9 @@ authoritative_references: # Spec-sharpen loop (front of the pipeline) -> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a -> human triage step. -> Cadence: **human-paced** — *you* are the sensor it waits on. It advances a -> round each time you answer the product questions it surfaces, then blocks again. -> Budget: escalate if the same question reappears after being answered. +> Loop shape: **converge-to-green** on the `/spec-review` verdict, gated by a human triage step. Cadence: **human-paced** — *you* are the sensor it waits on. It advances a round each time you answer the product questions it surfaces, then blocks again. Budget: escalate if the same question reappears after being answered. -You are turning issue **#$1** into a spec that passes `/spec-review` with zero -open questions. Work in rounds. The build loop downstream is only safe if the spec -going in is sharp: a vague spec does not produce vague code, it produces -confident, wrong code, faster. +You are turning issue **#$1** into a spec that passes `/spec-review` with zero open questions. Work in rounds. The build loop downstream is only safe if the spec going in is sharp: a vague spec does not produce vague code, it produces confident, wrong code, faster. ## Sensors @@ -34,31 +27,20 @@ confident, wrong code, faster. 1. Run `/spec` for the issue (or re-run after answers have been added). 2. Triage **every** open question into one of two buckets: - - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic - acceptance criteria, missing non-goals, ambiguous wording, contract or - format gaps. Resolve these and update the spec. - - **PRODUCT DECISION** you cannot ground in existing material — pricing, - policy, UX intent, scope trade-offs. Collect these. -3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human - as a numbered list. Wait for answers; do not invent them. -4. Run `/spec-review`. If FAIL, address the structural findings and start another - round. + - **RESOLVABLE** from the issue, linked tickets, or existing specs — atomic acceptance criteria, missing non-goals, ambiguous wording, contract or format gaps. Resolve these and update the spec. + - **PRODUCT DECISION** you cannot ground in existing material — pricing, policy, UX intent, scope trade-offs. Collect these. +3. If any PRODUCT DECISION questions remain, **STOP** and post them to the human as a numbered list. Wait for answers; do not invent them. +4. Run `/spec-review`. If FAIL, address the structural findings and start another round. ## Success criteria (the goal) -`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — -the spec is ready for `/plan` and the build loop. This loop produces a spec and -nothing else. +`/spec-review` returns **PASS** with zero open questions remaining. Then STOP — the spec is ready for `/plan` and the build loop. This loop produces a spec and nothing else. ## Guardrails — non-negotiable -- NEVER answer a product-decision question yourself to force a PASS. A clean spec - built on guessed intent is the most expensive failure in the whole pipeline — - every downstream loop trusts this spec, and will build, test, and ship the wrong - thing with great efficiency. +- NEVER answer a product-decision question yourself to force a PASS. A clean spec built on guessed intent is the most expensive failure in the whole pipeline — every downstream loop trusts this spec, and will build, test, and ship the wrong thing with great efficiency. - NEVER start design or code. No `/plan`, no `/build`. -- If the same question reappears after being answered, escalate; the issue itself - may be underspecified. +- If the same question reappears after being answered, escalate; the issue itself may be underspecified. ## Cadence (the loop wrapper) @@ -68,7 +50,4 @@ This is the one loop that pauses inside itself for a human on purpose: /loop /spec-sharpen-loop 1 ``` -It is human-paced by design. You are not waiting on a clock or on CI; the loop is -waiting on *your* product decisions. Answer them, and it advances a round; until -you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the -skill after you have answered the questions it raised. +It is human-paced by design. You are not waiting on a clock or on CI; the loop is waiting on *your* product decisions. Answer them, and it advances a round; until you do, it blocks. On Copilot or Windsurf, the cadence is the same — re-invoke the skill after you have answered the questions it raised. diff --git a/CHANGELOG.md b/CHANGELOG.md index 3bc407e..afad296 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,12 +9,8 @@ What counts as a breaking change for this toolkit is defined in [CONTRIBUTING.md ## [Unreleased] ### Added -- **Loop engineering** — the `loop-engineering` skill (the six-part loop pattern, - four loop shapes, and the issue-to-merged pipeline) plus four concrete loop - skills: `spec-sharpen-loop`, `sdd-build-loop`, `pr-quality-gate`, and - `pr-review-response`. Mirrored across `.claude/`, `.github/`, and `.windsurf/`. -- `docs/loop-engineering.md` documenting the loop pattern, shapes, composition, - and guardrails, linked from the README. +- **Loop engineering** — the `loop-engineering` skill (the six-part loop pattern, four loop shapes, and the issue-to-merged pipeline) plus four concrete loop skills: `spec-sharpen-loop`, `sdd-build-loop`, `pr-quality-gate`, and `pr-review-response`. Mirrored across `.claude/`, `.github/`, and `.windsurf/`. +- `docs/loop-engineering.md` documenting the loop pattern, shapes, composition, and guardrails, linked from the README. - `LICENSE` (MIT) at repo root. - `.gitignore` covering macOS, IDE, Java/Maven, Node/Angular, and harness artifacts. - `.editorconfig` for consistent line endings and indentation. diff --git a/README.md b/README.md index 5aae53b..d9a057f 100644 --- a/README.md +++ b/README.md @@ -173,10 +173,7 @@ The same gates the agent runs are reachable from a normal terminal: ## Loop engineering (optional) -Once you trust the harness, you can wrap the per-feature commands in **bounded -agentic loops** that converge on a checkable goal instead of running forever or -giving up too early. The toolkit ships the pattern as the `loop-engineering` -skill, plus four concrete loops that compose into an issue-to-merged pipeline: +Once you trust the harness, you can wrap the per-feature commands in **bounded agentic loops** that converge on a checkable goal instead of running forever or giving up too early. The toolkit ships the pattern as the `loop-engineering` skill, plus four concrete loops that compose into an issue-to-merged pipeline: | Loop | Skill | Drives | Cadence | | --- | --- | --- | --- | @@ -185,15 +182,9 @@ skill, plus four concrete loops that compose into an issue-to-merged pipeline: | Ship | `pr-quality-gate` | open PR → all CI gates green | polled on CI | | Review-response | `pr-review-response` | green PR → every reviewer thread addressed | polled on review | -Human judgment lands at exactly two points — answering genuine product questions -at the front, and deciding a branch is worth a PR in the middle. Every loop is -bounded, never merges on its own, and escalates when it is stuck. +Human judgment lands at exactly two points — answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Every loop is bounded, never merges on its own, and escalates when it is stuck. -In Claude Code these run via the native `/loop` command (for example -`/loop 10m /pr-quality-gate 1234`). On Copilot and Windsurf you drive the same -cadence by re-invoking the skill. See -[docs/loop-engineering.md](docs/loop-engineering.md) for the full pattern, the -four loop shapes, and the trust you have to earn before running the build loop. +In Claude Code these run via the native `/loop` command (for example `/loop 10m /pr-quality-gate 1234`). On Copilot and Windsurf you drive the same cadence by re-invoking the skill. See [docs/loop-engineering.md](docs/loop-engineering.md) for the full pattern, the four loop shapes, and the trust you have to earn before running the build loop. ## Repository layout diff --git a/docs/loop-engineering.md b/docs/loop-engineering.md index 945c96f..1dd9696 100644 --- a/docs/loop-engineering.md +++ b/docs/loop-engineering.md @@ -1,25 +1,14 @@ # Loop Engineering — Driving an Issue to Merged -Prompting an agent to fix a failing test is easy. Designing a system where one -loop sharpens the spec, another builds the feature to a clean reviewed branch, a -human makes the two calls that matter, and two more loops drive the pull request -through CI and human review — that is the engineering. +Prompting an agent to fix a failing test is easy. Designing a system where one loop sharpens the spec, another builds the feature to a clean reviewed branch, a human makes the two calls that matter, and two more loops drive the pull request through CI and human review — that is the engineering. -**Loop engineering** is designing agentic loops that converge on a goal instead of -running forever or giving up too early. This toolkit ships the pattern as the -[`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill, plus four -concrete loops that wrap the existing SDD commands. +**Loop engineering** is designing agentic loops that converge on a goal instead of running forever or giving up too early. This toolkit ships the pattern as the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill, plus four concrete loops that wrap the existing SDD commands. ## What an agentic loop actually is -A loop is not "run the agent again." A loop is a control system: the agent -observes the world, compares it to a goal, acts to close the gap, then observes -again. The skill is not in the acting — coding agents are already good at acting. -The skill is in defining *what the agent observes* and *when it is allowed to -stop*. +A loop is not "run the agent again." A loop is a control system: the agent observes the world, compares it to a goal, acts to close the gap, then observes again. The skill is not in the acting — coding agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*. -Every loop worth building has six parts. If you can name all six for your task, -you have a loop. If you cannot, you have a prompt you are running by hand. +Every loop worth building has six parts. If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand. - **Goal / success criteria** — a precise, checkable definition of "done". - **Sensors** — how the agent observes current state. @@ -28,8 +17,7 @@ you have a loop. If you cannot, you have a prompt you are running by hand. - **Bounded budget** — the escape hatch: max iterations, a deadline, a token cap. - **Guardrails** — the rules about what the agent must *not* do. -Most failed loops have a clear goal and good sensors, then run away because nobody -defined the budget or the guardrails. +Most failed loops have a clear goal and good sensors, then run away because nobody defined the budget or the guardrails. ## Four loop shapes @@ -42,14 +30,11 @@ The outer structure is always the same; the definition of *progress* differs. | Queue / batch | the work queue is empty | one hard item blocks the rest | | Repeat-until-confident | confidence over many runs | declaring victory after one run | -See the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill for -the detail on each shape and its tuning. +See the [`loop-engineering`](../.claude/skills/loop-engineering/SKILL.md) skill for the detail on each shape and its tuning. ## The four SDD loops -Each loop wraps SDD commands that already supply the sensors. They map one-to-one -onto Claude Code's `/loop` command and Skills; on Copilot and Windsurf you drive -the same cadence by re-invoking the skill. +Each loop wraps SDD commands that already supply the sensors. They map one-to-one onto Claude Code's `/loop` command and Skills; on Copilot and Windsurf you drive the same cadence by re-invoking the skill. | Loop | Skill | Wraps | Shape | Cadence | | --- | --- | --- | --- | --- | @@ -75,38 +60,21 @@ the same cadence by re-invoking the skill. human merges ``` -Human judgment lands at exactly two points: answering genuine product questions at -the front, and deciding a branch is worth a PR in the middle. Everything else is a -loop converging on a checkable goal. Each loop's cadence is dictated by how fast -the thing it watches can actually change — match cadence to your sensors and you -never pay for a check that cannot tell you anything new. +Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new. ## The trust you have to earn -The build loop is the one to be careful with, because it writes the feature -instead of polishing it. It does not *remove* human review — it **relocates** it, -replacing the per-task checkpoint with automated sensors (`/validate`, `/review`) -and one deliberate human validation round at the end. That trade is only safe once -the sensors are trustworthy enough to stand in for you on the small stuff. You -graduate into this loop; you do not start here. The trust checklist lives in the -[`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) skill. +The build loop is the one to be careful with, because it writes the feature instead of polishing it. It does not *remove* human review — it **relocates** it, replacing the per-task checkpoint with automated sensors (`/validate`, `/review`) and one deliberate human validation round at the end. That trade is only safe once the sensors are trustworthy enough to stand in for you on the small stuff. You graduate into this loop; you do not start here. The trust checklist lives in the [`sdd-build-loop`](../.claude/skills/sdd-build-loop/SKILL.md) skill. ## Guardrails for any loop - **Bound everything** — a maximum iteration count, ideally a wall-clock deadline. - **Never let a loop merge** — loops drive to merge-*ready* and stop. - **Forbid metric-gaming explicitly** — spell out the forbidden shortcuts. -- **Make every action reviewable and reversible** — one change per commit, branch - only, never force-push. -- **Escalate on low confidence** — stopping and saying so beats a tenth desperate - commit. +- **Make every action reviewable and reversible** — one change per commit, branch only, never force-push. +- **Escalate on low confidence** — stopping and saying so beats a tenth desperate commit. - **Match cadence to your sensors** — poll only when waiting on something slow. ## When to use a loop, and when not to -Use a loop when the goal is **objectively checkable**, progress is -**incremental**, and the work is **tedious enough** that convergence dominates the -thinking. A loop is the wrong tool when the goal is subjective, when there is no -reliable sensor, or when the hard part is a single decision. If you cannot write -the "are we done?" check as something that returns a boolean, you do not have a -loop yet — you have a conversation, and you should just have the conversation. +Use a loop when the goal is **objectively checkable**, progress is **incremental**, and the work is **tedious enough** that convergence dominates the thinking. A loop is the wrong tool when the goal is subjective, when there is no reliable sensor, or when the hard part is a single decision. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation. From 748e516bcffafba9d64f3ec371331ac04cc8c860 Mon Sep 17 00:00:00 2001 From: Loiane Date: Sun, 28 Jun 2026 12:03:04 -0400 Subject: [PATCH 3/3] Add per-tool loop usage examples to README Show how to run the loops on each platform: Claude Code's native /loop with worked commands for all four loops, and the one-pass-per-prompt pattern for Copilot Chat and Windsurf Cascade where there is no native /loop. --- README.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/README.md b/README.md index d9a057f..c599380 100644 --- a/README.md +++ b/README.md @@ -186,6 +186,38 @@ Human judgment lands at exactly two points — answering genuine product questio In Claude Code these run via the native `/loop` command (for example `/loop 10m /pr-quality-gate 1234`). On Copilot and Windsurf you drive the same cadence by re-invoking the skill. See [docs/loop-engineering.md](docs/loop-engineering.md) for the full pattern, the four loop shapes, and the trust you have to earn before running the build loop. +### Running a loop on each tool + +The skills are identical across platforms; only the way you *drive the cadence* differs. Claude Code has a native `/loop`; on Copilot and Windsurf the skill runs one pass per prompt and you re-invoke it each cycle (or stop when it reports done). + +**Claude Code** — the native `/loop` command supplies the cadence and budget: + +```text +/loop /spec-sharpen-loop 42 # issue #42 → a reviewed spec (human-paced) +/loop /sdd-build-loop 2026-05-09-create-customer # spec → clean reviewed branch (self-paced) +/loop 10m /pr-quality-gate 1234 # PR #1234 → all gates green, polling CI +/loop 15m /pr-review-response 1234 # work reviewer comments to zero +``` + +**GitHub Copilot** — no native `/loop`; prompt the skill in Copilot Chat and re-run it each cycle: + +```text +Run the pr-quality-gate loop on PR 1234: do one pass — check every gate, fix what +is red, commit and push, then stop. I'll re-run this after each CI cycle until it +posts "ready to merge". +``` + +Copilot loads `.github/skills/pr-quality-gate/SKILL.md`, does a single pass, and stops. Repeat the prompt once per CI run. + +**Windsurf** — no native `/loop`; ask Cascade to use the skill, one pass at a time: + +```text +Use the pr-quality-gate skill to drive PR 1234 to merge-ready. Do one pass: read +the gates, fix failures, commit, push, then stop. Re-invoke me on the next CI run. +``` + +Cascade activates `.windsurf/skills/pr-quality-gate/SKILL.md` by model decision. The same pattern works for all four loops — swap in `spec-sharpen-loop`, `sdd-build-loop`, or `pr-review-response` with the matching issue/feature/PR id. + ## Repository layout ```text