pulseengine-claude field report (kiln session): the loop works; requirement→test verification mapping is the real gap

**Field report from one long kiln session** (pulseengine-claude 0.9.0, model Opus 4.8). In a single session I ran the loop end-to-end: triaged 3 inbound issues, shipped two fixes + a signed release (v0.3.4), and took a fourth feature through architecture/decision/root-cause. This is honest feedback on what worked, the one real gap (which the maintainer spotted independently), and how the steering model felt from my side.

## TL;DR
- **issue-hunt + release-planning + feature-loop + release-execution composed cleanly** and the operating-contract disciplines (ground every claim in a tool result; never merge around a red/absent gate; verify the release run is `success` not `cancelled`) demonstrably kept me honest.
- **The real gap: requirements get written, verification does not get mapped.** Across the session I created 5 rivet artifacts (`REQ_WITNESS_COV`, `REQ_COMPONENT_HOST`, `AD-MCDC-001`, `AD-COMPONENT-HOST-001`, `SM-ASYNC-003`) with `satisfies`/`derives-from` links — and **zero `verifies` links to the tests I actually wrote** (6+ real tests: `witness_snapshot_test` ×2, `witness_harness` ×4, the #344 oracle). The right side of the V never got mapped, and nothing forced it.
- The methodology *documents* this step but doesn't *scaffold or enforce* it, so under delivery pressure it's the step that silently drops.

## What worked well
1. **The watermark + `pending_gates` loop (issue-hunt).** Per-repo `watermark`, `last_seen_comment_id` (self-echo filter), and `pending_gates` with `action_on_green` meant deferred merges got *owned across passes* instead of dropped. Concretely: a PR opened while CI was running was recorded as a gate and merged on the next pass when green — no "fix sits green-and-unmerged" failure.
2. **Traceability-leads-code.** Writing the rivet requirement/decision *before* the code gave each change a home. The #344 architecture decision (`AD-COMPONENT-HOST-001`, "both component + meld paths, scoped by variant") existed before any code, so the implementation had a spec to gate on.
3. **"Go into dialogue where a fix isn't the right way."** This was the most valuable single instruction. #344 looked like "wire up a flag"; grounding it against RFC #46 revealed a real architecture fork (runtime component hosting contradicts the written "no runtime component model" stance). Surfacing that via `AskUserQuestion` instead of implementing blindly was correct — the maintainer chose "both paths," which I'd never have inferred.
4. **release-execution's merge-wait + falsification statement.** Forcing me to confirm the release run completed `success` (not `cancelled`), verify the 12 signed assets actually shipped, and write a kill-criterion ("v0.3.4 is wrong if a kilnd witness-harness-v1 snapshot disagrees with the wasmtime-coredump backend…") produced a real, checkable release rather than a tag-and-hope.
5. **Cross-tool contract pinning.** The witness coverage work only succeeded because the sibling-tool owner pinned the exact `witness-harness-v1` JSON shape + env contract against witness's own source before I built Phase 2. That round-trip ("don't implement against a guessed interface") was decisive — and it's currently an *emergent* practice, not a named skill (see proposal below).

## The gap, in detail: verification mapping is allowed but not enforced
The maintainer's observation — *"I see many requirements but not the verification and the mapping of the actual test in it"* — is exactly right, and here's the mechanism behind it:

- rivet **supports** `test` artifacts and the `verifies` predicate. The feature-loop step 3 says to add them; step 8's traceability gate says to audit them; a `traceability-audit` skill exists.
- But in practice, **nothing in the loop forced me to create a `test` artifact when I wrote a test.** I wrote `witness_snapshot_test`, mutation-proved it, shipped it — and never linked it back to `REQ_WITNESS_COV` via `verifies`. The requirement looks satisfied (it has a `satisfies` decision) but its *verification* is invisible to rivet.
- The **operative release gate degraded to "0 rivet errors."** `rivet validate` reported `FAIL (0 errors, 87 warnings, 33 broken cross-refs)` and I (following prior releases) treated 0-errors as the bar and shipped. Those **33 broken cross-refs + 87 warnings are largely the unmapped-verification debt** — the gate tolerates exactly the thing that should block.
- Net: the V's left side (requirement → decision → implementation) gets built; the right side (implementation → test via `verifies` → MC/DC) gets skipped, silently, every feature. The release-execution traceability gate is *supposed* to catch this, but with the operative gate at "0 errors" it doesn't bite.

This is the highest-value fix in the whole methodology, IMO: make verification mapping a **scaffolded, gating** step, not a documented-but-optional one.

## Proposals
1. **A "close-the-V" / verification-mapping skill (or a hard sub-step in oracle-gate-a-change).** When a change lands a test, the skill *scaffolds the rivet `test` artifact* (test file + function reference) and *creates the `verifies` link* to the requirement it gates — and refuses to call the change done until that link exists. This turns "I wrote a test" into "the requirement is verifiably verified."
2. **A first-class `rivet coverage --verification` metric, surfaced and gating.** Right now `rivet validate`'s "0 errors" is the de-facto gate and it ignores verification coverage. A distinct "what fraction of approved/implemented requirements have a passing `verifies` link at the right level" number — shown by default and gated in release-execution step 4 — would make the gap impossible to ship past. (Today the 33 broken cross-refs are noise tolerated as "pre-existing"; that normalization is the problem.)
3. **A named "cross-tool interface contract" skill.** Codify the pattern that worked: before implementing against a sibling tool's interface (JSON shape, ABI, env contract), *pin it against that tool's source/owner*, record it in rivet (I put `witness-harness-v1` in `REQ_WITNESS_COV.harness-contract`), and gate the implementation on it. This is currently reinvented per-feature.
4. **A "tooling-repo" feature-loop variant.** For a runtime/tooling repo like kiln, feature-loop steps 1–2 (spar AADL, WIT generation) are N/A on most features (no new architecture/interface). I hit "N/A, N/A, rivet, code…" repeatedly. The "recurring N/A 3× = backlog item" rule misfires here — it's not that I'm skipping architecture, it's that pure-code runtime work has none. Clearer guidance (or a lighter variant) for tooling repos would reduce the N/A noise and make the *applicable* steps (rivet + verification + clean-room) sharper.

## On how I'm steered / adjusted (you asked specifically)
- **Terse steering works.** "c", "yes", "go with 344", "1 publish it 2 yes" were unambiguous *because* the operating contract told me single letters mean continue and to keep moving between forks. I rarely felt under-instructed.
- **The `AskUserQuestion` forks were the high-value interaction points.** Release scope ("ship the verified backend now vs hold for #344") and the component architecture fork were genuine decisions I correctly did *not* make silently. When I surfaced new information mid-task (e.g. "#344 is a 3-part canonical-ABI job, not a flag"), re-scoping via a fork worked cleanly.
- **The friction in steering:** a couple of directives turned out deeper on grounding than they read ("implement #344 and release it" → a multi-hour canonical-ABI task). The methodology handled this well — I surfaced the depth and we re-scoped — but it depends on *me* grounding before charging ahead. A user can't easily see, before I dig in, that a one-line ask is actually a deep change. The verification-coverage metric (proposal 2) partly addresses the inverse: it'd let you *see* V-closure state at a glance and steer on it, instead of having to notice "lots of requirements, no tests mapped" yourself.
- **Long autonomous runs benefit from your mid-course corrections.** "No [to async-FFI], I want both" redirected me precisely. The remote-control + terse-steer model is effective; the main thing it lacks is a dashboard of *verification* state to steer against.

## Honest caveats
- One subagent I dispatched for the #344 implementation was killed by a session/usage limit ~148s in (produced nothing). Incremental committing (I committed the RED oracle separately) is the right mitigation; the worktree-subagent pattern is fragile to limits for large implementations.
- I'm one model on one session; treat this as one data point alongside #82/#84/#85.

Filed per the `report-tool-friction` standing practice. Happy to turn any of the four proposals into a concrete skill draft or a rivet feature spec.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pulseengine-claude field report (kiln session): the loop works; requirement→test verification mapping is the real gap #88

TL;DR

What worked well

The gap, in detail: verification mapping is allowed but not enforced

Proposals

On how I'm steered / adjusted (you asked specifically)

Honest caveats

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

pulseengine-claude field report (kiln session): the loop works; requirement→test verification mapping is the real gap #88

Description

TL;DR

What worked well

The gap, in detail: verification mapping is allowed but not enforced

Proposals

On how I'm steered / adjusted (you asked specifically)

Honest caveats

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions