Skip to content

pulseengine-claude field report (kiln session): the loop works; requirement→test verification mapping is the real gap #88

Description

@avrabe

Field report from one long kiln session (pulseengine-claude 0.9.0, model Opus 4.8). In a single session I ran the loop end-to-end: triaged 3 inbound issues, shipped two fixes + a signed release (v0.3.4), and took a fourth feature through architecture/decision/root-cause. This is honest feedback on what worked, the one real gap (which the maintainer spotted independently), and how the steering model felt from my side.

TL;DR

  • issue-hunt + release-planning + feature-loop + release-execution composed cleanly and the operating-contract disciplines (ground every claim in a tool result; never merge around a red/absent gate; verify the release run is success not cancelled) demonstrably kept me honest.
  • The real gap: requirements get written, verification does not get mapped. Across the session I created 5 rivet artifacts (REQ_WITNESS_COV, REQ_COMPONENT_HOST, AD-MCDC-001, AD-COMPONENT-HOST-001, SM-ASYNC-003) with satisfies/derives-from links — and zero verifies links to the tests I actually wrote (6+ real tests: witness_snapshot_test ×2, witness_harness ×4, the #344 oracle). The right side of the V never got mapped, and nothing forced it.
  • The methodology documents this step but doesn't scaffold or enforce it, so under delivery pressure it's the step that silently drops.

What worked well

  1. The watermark + pending_gates loop (issue-hunt). Per-repo watermark, last_seen_comment_id (self-echo filter), and pending_gates with action_on_green meant deferred merges got owned across passes instead of dropped. Concretely: a PR opened while CI was running was recorded as a gate and merged on the next pass when green — no "fix sits green-and-unmerged" failure.
  2. Traceability-leads-code. Writing the rivet requirement/decision before the code gave each change a home. The #344 architecture decision (AD-COMPONENT-HOST-001, "both component + meld paths, scoped by variant") existed before any code, so the implementation had a spec to gate on.
  3. "Go into dialogue where a fix isn't the right way." This was the most valuable single instruction. #344 looked like "wire up a flag"; grounding it against RFC fix(blog-autopublish): use App-minted token instead of GITHUB_TOKEN #46 revealed a real architecture fork (runtime component hosting contradicts the written "no runtime component model" stance). Surfacing that via AskUserQuestion instead of implementing blindly was correct — the maintainer chose "both paths," which I'd never have inferred.
  4. release-execution's merge-wait + falsification statement. Forcing me to confirm the release run completed success (not cancelled), verify the 12 signed assets actually shipped, and write a kill-criterion ("v0.3.4 is wrong if a kilnd witness-harness-v1 snapshot disagrees with the wasmtime-coredump backend…") produced a real, checkable release rather than a tag-and-hope.
  5. Cross-tool contract pinning. The witness coverage work only succeeded because the sibling-tool owner pinned the exact witness-harness-v1 JSON shape + env contract against witness's own source before I built Phase 2. That round-trip ("don't implement against a guessed interface") was decisive — and it's currently an emergent practice, not a named skill (see proposal below).

The gap, in detail: verification mapping is allowed but not enforced

The maintainer's observation — "I see many requirements but not the verification and the mapping of the actual test in it" — is exactly right, and here's the mechanism behind it:

  • rivet supports test artifacts and the verifies predicate. The feature-loop step 3 says to add them; step 8's traceability gate says to audit them; a traceability-audit skill exists.
  • But in practice, nothing in the loop forced me to create a test artifact when I wrote a test. I wrote witness_snapshot_test, mutation-proved it, shipped it — and never linked it back to REQ_WITNESS_COV via verifies. The requirement looks satisfied (it has a satisfies decision) but its verification is invisible to rivet.
  • The operative release gate degraded to "0 rivet errors." rivet validate reported FAIL (0 errors, 87 warnings, 33 broken cross-refs) and I (following prior releases) treated 0-errors as the bar and shipped. Those 33 broken cross-refs + 87 warnings are largely the unmapped-verification debt — the gate tolerates exactly the thing that should block.
  • Net: the V's left side (requirement → decision → implementation) gets built; the right side (implementation → test via verifies → MC/DC) gets skipped, silently, every feature. The release-execution traceability gate is supposed to catch this, but with the operative gate at "0 errors" it doesn't bite.

This is the highest-value fix in the whole methodology, IMO: make verification mapping a scaffolded, gating step, not a documented-but-optional one.

Proposals

  1. A "close-the-V" / verification-mapping skill (or a hard sub-step in oracle-gate-a-change). When a change lands a test, the skill scaffolds the rivet test artifact (test file + function reference) and creates the verifies link to the requirement it gates — and refuses to call the change done until that link exists. This turns "I wrote a test" into "the requirement is verifiably verified."
  2. A first-class rivet coverage --verification metric, surfaced and gating. Right now rivet validate's "0 errors" is the de-facto gate and it ignores verification coverage. A distinct "what fraction of approved/implemented requirements have a passing verifies link at the right level" number — shown by default and gated in release-execution step 4 — would make the gap impossible to ship past. (Today the 33 broken cross-refs are noise tolerated as "pre-existing"; that normalization is the problem.)
  3. A named "cross-tool interface contract" skill. Codify the pattern that worked: before implementing against a sibling tool's interface (JSON shape, ABI, env contract), pin it against that tool's source/owner, record it in rivet (I put witness-harness-v1 in REQ_WITNESS_COV.harness-contract), and gate the implementation on it. This is currently reinvented per-feature.
  4. A "tooling-repo" feature-loop variant. For a runtime/tooling repo like kiln, feature-loop steps 1–2 (spar AADL, WIT generation) are N/A on most features (no new architecture/interface). I hit "N/A, N/A, rivet, code…" repeatedly. The "recurring N/A 3× = backlog item" rule misfires here — it's not that I'm skipping architecture, it's that pure-code runtime work has none. Clearer guidance (or a lighter variant) for tooling repos would reduce the N/A noise and make the applicable steps (rivet + verification + clean-room) sharper.

On how I'm steered / adjusted (you asked specifically)

  • Terse steering works. "c", "yes", "go with 344", "1 publish it 2 yes" were unambiguous because the operating contract told me single letters mean continue and to keep moving between forks. I rarely felt under-instructed.
  • The AskUserQuestion forks were the high-value interaction points. Release scope ("ship the verified backend now vs hold for #344") and the component architecture fork were genuine decisions I correctly did not make silently. When I surfaced new information mid-task (e.g. "#344 is a 3-part canonical-ABI job, not a flag"), re-scoping via a fork worked cleanly.
  • The friction in steering: a couple of directives turned out deeper on grounding than they read ("implement #344 and release it" → a multi-hour canonical-ABI task). The methodology handled this well — I surfaced the depth and we re-scoped — but it depends on me grounding before charging ahead. A user can't easily see, before I dig in, that a one-line ask is actually a deep change. The verification-coverage metric (proposal 2) partly addresses the inverse: it'd let you see V-closure state at a glance and steer on it, instead of having to notice "lots of requirements, no tests mapped" yourself.
  • Long autonomous runs benefit from your mid-course corrections. "No [to async-FFI], I want both" redirected me precisely. The remote-control + terse-steer model is effective; the main thing it lacks is a dashboard of verification state to steer against.

Honest caveats

Filed per the report-tool-friction standing practice. Happy to turn any of the four proposals into a concrete skill draft or a rivet feature spec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    methodology-field-reportField report on the pulseengine methodology/skills from a real session

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions