Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions .claude/skills/loop-engineering/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
name: loop-engineering
description: Design bounded agentic loops that converge on a checkable goal instead of running forever or giving up too early. Use when wrapping any repetitive "drive this artifact to a desired state" task in a loop, and to compose the four SDD loops (spec-sharpen, build, ship, review-response) into an issue-to-merged pipeline.
when_to_use:
- Before building any loop — to name the six parts and pick the right shape.
- When composing the spec-sharpen, build, ship, and review-response loops into one pipeline.
- When a hand-run sequence (re-run `/spec` until clean, fix-push-wait until CI is green) is begging to be automated.
authoritative_references:
- docs/loop-engineering.md
- .claude/skills/spec-sharpen-loop/SKILL.md
- .claude/skills/sdd-build-loop/SKILL.md
- .claude/skills/pr-quality-gate/SKILL.md
- .claude/skills/pr-review-response/SKILL.md
---

# Loop engineering

> A loop is not "run the agent again." A loop is a control system: observe the world, compare it to a goal, act to close the gap, observe again. The skill is not in the acting — agents are already good at acting. The skill is in defining *what the agent observes* and *when it is allowed to stop*.

## The six parts of every loop

If you can name all six for your task, you have a loop. If you cannot, you have a prompt you are running by hand.

- **Goal / success criteria.** A precise, checkable definition of "done" that a script could answer yes or no to. Not "make the PR good" — "all checks green, coverage ≥ 90%, zero new static-analysis issues".
- **Sensors.** How the agent observes current state. This is the part people skip and the part that matters most; a loop is only as good as what it can measure. CI status, a coverage report, a Sonar API response, the output of a test run repeated twenty times.
- **Action.** What the agent changes when state does not match the goal. Fix the code, write the missing test, bump the dependency, commit, push.
- **Cadence.** How often the loop runs. Event-driven (wait for CI), self-paced (drain a local queue as fast as the work allows), or human-paced (wait on an answer). The wrong cadence either wastes money or misses the signal.
- **Bounded budget.** The escape hatch: maximum iterations, a wall-clock deadline, or a token cap. This is what prevents a loop from grinding forever on a problem it cannot solve.
- **Guardrails.** The rules about what the agent must *not* do. Never merge. Never force-push. Never touch files outside the change. Never disable a failing test to make it pass.

Most failed loops have a clear goal and good sensors, then run away because nobody defined the **budget** or the **guardrails**. Those last two are non-negotiable.

## The shape of a loop

```text
┌──────────────────────┐
│ Start: artifact open │
└──────────┬───────────┘
┌──────────────────────────┐
┌────▶│ OBSERVE (run sensors) │
│ └────────────┬─────────────┘
│ ▼
│ ┌─────────────┐ yes ┌─────────────────────┐
│ │ goal met? │──────▶│ STOP: done ✅ │
│ └──────┬──────┘ └─────────────────────┘
│ no │
│ ▼
│ ┌──────────────────┐
│ │ ACT: smallest │
│ │ correct change │
│ └────────┬─────────┘
│ ▼
│ ┌─────────────┐ no ┌─────────────────────┐
│ │ budget left?│──────▶│ STOP: escalate 🚩 │
│ └──────┬──────┘ └─────────────────────┘
│ yes │
│ ▼
│ ┌──────────────────┐
└─────────│ WAIT (cadence) │
└──────────────────┘
```

The escalation branch is not a failure — it is the loop doing its job. A loop that knows when to hand the problem back is more useful than one that pretends it can solve everything.

## Four loop shapes

The outer structure (observe → compare → act → check budget → repeat) is the same. What differs is the definition of *progress*. Tune each shape to its own failure mode.

| Shape | "Done" means | Failure mode to watch |
| --- | --- | --- |
| Converge-to-green | every binary check passes | oscillating: fixing A breaks B |
| Incremental-to-a-number | a metric crosses a threshold | gaming the metric (assertion-free tests) |
| Queue / batch | the work queue is empty | one hard item blocks the rest |
| Repeat-until-confident | statistical confidence over many runs | declaring victory after one run |

### Converge-to-green

State is binary per check. Re-run **all** sensors after each change, not just the one you were working on, so you notice when you traded one red for another.

### Incremental-to-a-number

Always attack the lowest-scoring item next. Demand *meaningful* change — an agent told to "increase coverage" will write tests that execute code without asserting anything, because that moves the metric. Your success criteria must forbid it.

### Queue / batch

Drain a work queue one item at a time; "done" is an empty queue. Give each item its own small iteration cap. When an item exhausts it, **skip it, leave it in the queue, and report it** — do not let one hard item block the easy ones.

### Repeat-until-confident

The signal is non-deterministic, so a single observation tells you almost nothing. "Done" is *passed N consecutive runs*, where N grows with how intermittent the original failure was. Never declare victory after one run.

## The four SDD loops

These wrap the existing SDD commands, which already supply the sensors:

1. **Spec-sharpen** (`spec-sharpen-loop`) — issue → a spec that passes `/spec-review` with zero open questions. Human-paced.
2. **Build** (`sdd-build-loop`) — sharp spec → a clean, `/review`-APPROVED feature branch. Self-paced. A queue (drain `/build` tasks) wrapping a converge-to-green (fix until `/validate` and `/review` are clean).
3. **Ship** (`pr-quality-gate`) — open PR → merge-ready. Polled on CI.
4. **Review-response** (`pr-review-response`) — green PR → every actionable reviewer thread addressed. Polled on review activity.

## Composing them: issue → merged

```text
issue
│ LOOP: spec-sharpen (human-paced) → escalate product decisions
▼ sharp spec
│ LOOP: build (self-paced) → auto-commit on a feature branch
▼ clean branch
│ HUMAN GATE: run the app, read the diff, decide — open the PR?
▼ pull request
│ LOOP: ship (polled on CI) → all gates green
▼ green PR, under review
│ LOOP: review-response (polled on review) → no threads remain
human merges
```

Human judgment lands at exactly two points: answering genuine product questions at the front, and deciding a branch is worth a PR in the middle. Everything else is a loop converging on a checkable goal. Each loop's cadence is dictated by how fast the thing it watches can actually change — match cadence to your sensors and you never pay for a check that cannot tell you anything new.

## Guardrails for any loop

- **Bound everything.** Every loop needs a maximum iteration count, ideally a wall-clock deadline too. A loop without a budget is a runaway process.
- **Never let a loop merge.** Loops drive to merge-*ready* and stop. A human owns the merge.
- **Forbid metric-gaming explicitly.** Any loop pointed at a number or a verdict will try the easy way — disabling a flaky test, writing assertion-free tests, editing the spec so `/review` stops complaining. Spell out the forbidden shortcuts; the agent will not infer them.
- **Make every action reviewable and reversible.** One logical change per commit, conventional messages, branch only, never force-push.
- **Escalate on low confidence.** "I tried these three fixes and none worked, here is what I observed" beats a tenth desperate commit.
- **Match cadence to your sensors.** Poll on an interval only when waiting on something slow like CI. When sensors are local, run as fast as the work allows.

## When to use a loop, and when not to

Use a loop when three things are true: the goal is **objectively checkable** (a sensor returns yes/no), progress is **incremental** (each iteration gets measurably closer), and the work is **tedious enough** that convergence dominates the thinking.

A loop is the *wrong* tool when the goal is subjective ("make this API design elegant"), when there is no reliable sensor, or when the hard part is a single decision rather than a sequence of mechanical steps. If you cannot write the "are we done?" check as something that returns a boolean, you do not have a loop yet — you have a conversation, and you should just have the conversation.

The criteria *is* the work. Once you can state precisely what "done" means and how a sensor would detect it, the loop almost writes itself.
71 changes: 71 additions & 0 deletions .claude/skills/pr-quality-gate/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
name: pr-quality-gate
description: Drive one GitHub PR to merge-ready by checking every quality gate and fixing whatever is red, then committing, pushing, and waiting for the next CI run. The ship loop. Stops when the PR is fully green or the iteration budget runs out. Never merges.
when_to_use:
- A PR is open and you want it driven to all-green without babysitting each CI round.
- After the build loop hands you a feature branch and you have opened the PR.
- Run as a recurring loop: `/loop 10m /pr-quality-gate <pr-number>` in Claude Code, or re-invoke this skill each CI cycle on Copilot/Windsurf.
authoritative_references:
- docs/loop-engineering.md
- .claude/skills/loop-engineering/SKILL.md
- .claude/skills/jacoco-coverage-policy/SKILL.md
- .claude/skills/pr-review-response/SKILL.md
---

# PR quality gate (the ship loop)

> Loop shape: **converge-to-green**, with a **coverage climber** nested inside. Cadence: **polled on CI** (the sensor only changes once per CI run). Budget: **10 iterations**, then escalate.

You are driving pull request **#$1** to merge-ready. Do exactly **one pass** per invocation. The loop runner (`/loop`) supplies the cadence and the budget; this skill supplies the goal, the sensors, the action, and the guardrails.

## Sensors — gather current state first

Re-run *all* of these every pass, not just the gate you fixed last time, so you catch a fix that traded one red for another:

- `gh pr checks $1` — CI check status (unit, integration, architecture tests).
- `gh pr view $1 --json statusCheckRollup,mergeable` — the rollup and mergeability.
- The coverage report from the latest CI run artifacts (or `./.github/scripts/check-new-code-coverage.sh` locally).
- The static-analysis report (Checkstyle / SpotBugs / Sonar) for **new** issues introduced by this PR.

## Success criteria (the goal)

The PR is merge-ready ONLY when ALL of these are true:

- Every CI check is passing.
- Checkstyle / static analysis reports zero new violations on this PR.
- Line coverage on new code is at or above the project threshold (see `jacoco-coverage-policy`).
- Sonar (if configured) reports zero new issues on this PR.

When all are true, post a comment `✅ All gates green — ready to merge` and **STOP**. Do not merge the PR yourself.

## Action — fix exactly what is failing

For each failing gate, make the **smallest correct change that fixes the root cause**, then commit with a conventional-commit message and push to the PR branch.

- **Tests / Checkstyle / Sonar** (converge-to-green): fix the code, then re-run *all* sensors so a fix for one gate has not broken another.
- **Coverage** (incremental-to-a-number): find the class furthest below the line, write *meaningful* tests for its uncovered branches, re-check the number, repeat with the next-lowest file. Tests must assert behavior, not just execute lines.

## Guardrails — non-negotiable

- NEVER merge, approve, or close the PR.
- NEVER force-push or rewrite history.
- NEVER disable, skip, `@Disabled`, or delete a test to make a gate pass.
- NEVER lower a coverage threshold or add a Checkstyle/SpotBugs suppression to silence a finding instead of fixing it.
- NEVER touch files unrelated to this PR's diff.
- If a failure is ambiguous, or you are not confident in the fix, post a comment explaining what you observed and what you tried, then STOP for a human.

## Cadence and budget (the loop wrapper)

Poll roughly as often as CI returns — there is no point checking every thirty seconds when the signal only changes once per CI run:

```text
/loop 10m /pr-quality-gate 1234
```

Stop when the PR posts the "ready to merge" comment, **or** after **10 iterations**, whichever comes first. On the tenth iteration without success, summarize what is still failing and what you tried, then stop and escalate.

On Copilot or Windsurf (no native `/loop`), drive the cadence yourself: invoke this skill once per CI cycle and stop on the same two conditions.

## Handoff

A green PR is the input the `pr-review-response` loop wants. When a human reviewer leaves comments, switch to that loop to work them to zero.
53 changes: 53 additions & 0 deletions .claude/skills/pr-review-response/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
name: pr-review-response
description: Resolve actionable human review comments on a GitHub PR until none remain and the gates stay green — make the change, push, reply, repeat. The end of the pipeline. Triages each thread, escalates disagreements, and never merges or resolves a reviewer's thread.
when_to_use:
- A green PR is under human review and the comments need to be worked to zero.
- After the ship loop (`pr-quality-gate`) has driven the PR to all-green and a reviewer has left feedback.
- Run as a loop polled on review activity: `/loop 15m /pr-review-response <pr-number>` in Claude Code, or re-invoke this skill as new comments arrive on Copilot/Windsurf.
authoritative_references:
- docs/loop-engineering.md
- .claude/skills/loop-engineering/SKILL.md
- .claude/skills/pr-quality-gate/SKILL.md
- .claude/skills/spring-code-review-rubric/SKILL.md
---

# PR review-response loop (end of the pipeline)

> Loop shape: **queue** — drain the unresolved review threads one at a time. The mirror image of the ship loop: that one converges the PR on CI, this one converges it on the humans. Cadence: **polled on review activity** (reviewers comment over hours, not seconds). Budget: escalate on disagreement or ambiguity; never argue in a thread.

You are responding to human review on PR **#$1**. Do exactly **one pass** per invocation.

## Sensors

- `gh pr view $1 --json reviews,comments,reviewThreads` — list every **UNRESOLVED** thread.
- `gh pr checks $1` — current gate status.

## Triage each unresolved thread

- **CHANGE REQUESTED** ("do X", "rename Y", "handle case Z"): make the smallest correct change, commit, push, and reply in the thread linking the commit. Do NOT resolve the thread — the reviewer does that.
- **QUESTION** ("why did you…?"): reply in the thread with an explanation. Change code only if the honest answer is "you're right, fixing it".
- **DISAGREEMENT or ambiguous intent**: STOP and escalate to your human. Do not argue in the thread, and do not silently comply.

After any code change, re-run the ship-loop gates (`pr-quality-gate`) so a review fix does not re-break CI.

## Success criteria (the goal)

No unresolved actionable threads remain AND all gates are green. Then STOP. Do not merge.

## Guardrails — non-negotiable

- NEVER merge, approve, or resolve a reviewer's thread.
- NEVER push back on a reviewer; escalate disagreements to your human.
- One logical change per commit; never force-push.
- If a comment needs a product or architecture decision, escalate instead of deciding it inside the loop.

## Cadence (the loop wrapper)

Poll on an interval, because the sensor is human review activity:

```text
/loop 15m /pr-review-response 1234
```

The hard part is not the editing, it is the **triage**. A loop that treats every comment as "change requested" will rewrite code in response to a reviewer who was only asking a question; a loop that treats every comment as a question will ignore real change requests. Spelling out the difference, and forcing an escalation on disagreement, is what keeps the loop from arguing with your reviewer on your behalf. On Copilot or Windsurf, re-invoke this skill as new comments arrive.
Loading