Skip to content

Experience report: pulseengine-claude skills in a long multi-repo session (issue-hunt, /loop, cross-repo coordination, requirement→test visibility, instruction style) #91

Description

@avrabe

Experience report — PulseEngine Claude skills in a long multi-repo session

Filed at the maintainer's request: an honest account of how the
pulseengine-claude skills (v0.8.0) performed across a multi-day,
multi-repo session driving meld, with attention to issue-hunt, the
/loop self-pacing, cross-repo coordination, the
requirements→verification→test mapping, and — explicitly requested —
how the human instructs/adjusts the agent. Grounded in one concrete
arc: meld#298 (drop vestigial cabi_realloc to unblock --memory shared)
→ meld#300 (commit the inter-component isolation model) and the
coordination with the wit-bindgen fork / gale / synth around it.

TL;DR

  • The standing-contract memory (operating-contract, philosophy, oracle-gating, "hallucinations cost more than silence", Tier-5-Mythos-before-merge) is the strongest part — it reliably stopped me from shipping unsound code and kept claims tied to tool output. Keep it.
  • issue-hunt and /loop compose well for bounded work but have no good termination/parking story when work becomes externally gated — the loop keeps firing with nothing to do, which pressures toward make-work.
  • Cross-repo coordination has no skill. The most consequential work this session (meld#300, an ASIL-D isolation-model decision spanning meld/gale/synth) was done ad-hoc. This is the clearest gap.
  • The requirements→verification→test mapping exists and is complete in coverage (all 41 SRs), but is invisible (separate file from the requirements) and honestly partial (24/41 status: partial). The skill allows it; the gap is surfacing and completion, not capability.

1. What worked well

  • Memory/situational-awareness hooks: session-start situational awareness + working-context checkpoints survived compaction cleanly. After a context summary I resumed the #298 design (the rewriter.rs coupling, the over-drop hazard) without re-deriving it. High value.
  • operating-contract as a guardrail: "ground every claim in a tool result", "never merge around a red/absent gate", "fail-safe toward keeping". Concretely: I paused the corruption-critical #298 drop wiring rather than rush it autonomously — the contract made that the obvious call, not a judgment I had to invent.
  • oracle-gate-a-change: the "write the failing oracle first" discipline produced the most durable artifacts of the session (the #298 kill-criterion + real-artifact blocker tests) before any risky code.
  • clean-room / Mythos Tier-5 gate: knowing "this can't merge without a clean-room pass" kept me from treating a green local test as done.

2. issue-hunt — how it works and how it felt

The watermark model (per-repo .json, "pull what changed since last pass, triage, land in rivet, advance watermark only on success") is sound and the anti-patterns list is good. Friction:

  • No convergence when the tracker is quiet. A pass that finds "only the issues I already worked" has no defined action other than "park". The skill says the exit condition is a release, but in a quiet window with an in-flight decision (meld#300) there's no release to cut and no new issues — the skill doesn't describe what a no-op pass should do, so the loop tends to manufacture activity. A "nothing actionable; park and report why" branch would help.
  • Triage→rivet friction: landing an architecture decision (meld#300) didn't fit the requirement/bug/optimization taxonomy cleanly. I recorded it as an ADR (safety/adr/ADR-4) instead — which felt right, but the skill points at rivet requirements, not ADRs, for decisions. The decision-vs-requirement distinction could be explicit.

3. /loop (dynamic mode) — how it felt to self-pace

  • Cache-window guidance is good and I used it (270s vs 1200–1800s reasoning). But there's a termination gap: the loop is built to continue, and "stop when work is genuinely done/gated" requires me to override the momentum. Several iterations this session were honestly parked (waiting on a human decision + another repo), and the right move was a long fallback + "no make-work" — but nothing in the skill blesses parking, so it reads as under-performing the loop.
  • Bare-token ambiguity: the human typed 300 mid-loop. I parsed it as a 300s interval (the loop's rule-1 grammar primes that reading); they meant issue #300. The loop's leading-token rule (^\d+[smhd]$) made "a bare number = time" the salient interpretation. Suggestion: when a bare token is ambiguous (no unit, could be an issue ref in an active issue context), ask rather than assume — or the loop skill could note that bare integers are ambiguous.

4. Cross-repo coordination — the biggest gap

meld#300 was a genuine multi-repo architecture decision: synth (maintainer avrabe) asked meld to commit to an isolation model (single shared memory + MPU-carve vs multi-memory structural isolation) because gale#86's design forked on it. Resolving it meant: reading the cross-repo context, grounding meld's actual capability (it already supports both via MemoryStrategy), surfacing the decision to the human, posting a committed answer, and propagating consequences (reclassifying meld#298/#299 secondary, recording ADR-4, noting the synth#369/#404 follow-up).

None of the 13 skills covers this. issue-hunt and release-planning are single-repo. There's no skill for: detect a cross-repo coordination issue → gather each repo's real state → identify the decision that's actually mine vs theirs → surface it → record the commitment with traceability → propagate to the dependent issues/repos. This is recurring in this toolchain (the rivet.yaml externals: block — kiln/synth/etc. — shows cross-repo trace is a first-class concern) and was the highest-stakes work of the session, done entirely ad-hoc.

Proposed skill: cross-repo-coordination (or coordinate-decision). Inputs: a coordination issue + the externals map. Steps: (1) ground each repo's real state (don't take the issue's framing at face value — I had to correct an overstated gap this session); (2) classify each open sub-decision as ours / theirs / joint; (3) for ours, surface to the human (it's usually a judgment call with safety implications); (4) record the commitment as an ADR + propagate (relabel dependent issues, cross-link); (5) hand the rest back with a crisp "what we need from you". This would have structured exactly what I did manually.

5. Requirements → verification → test mapping (the maintainer's specific concern)

The concern was: "I see many requirements but not the verification and the mapping of the actual test — does the skill allow this?"

It does, and it's already there — it's just not visible from where you read requirements. Grounded in meld today:

  • safety/requirements/safety-requirements.yaml — each SR carries verification-method (test | proof) + verification-description.
  • safety/requirements/traceability.yamlverification-status:all 41 SRs map to implementation-files, tests (fully-qualified test names), proofs, and a status.
  • traceability-audit (skill) + release-execution's "traceability completeness gate" consume this; rivet is a real binary and there's a verification-gate.yml CI workflow.

So capability is not the gap. Two real gaps explain the "I don't see it" perception:

  1. Discoverability / co-location. The SR→test mapping lives in a different file (traceability.yaml) from the SR text (safety-requirements.yaml). Reading the requirements file, you see requirements with no tests beside them. A rendered view (or a rivet command) that shows, per requirement: text · verification-method · mapped test names · status inline would close this immediately. Proposed: a small skill/command traceability-render (or extend traceability-audit with a human-facing report) that emits the V per requirement.
  2. Honesty of completion. Of 41 SRs, 24 are status: partial (e.g. "test exists; no count-based completeness verification"), 10 verified, 6 implemented, 4 resolved, 1 not-verified. So for most requirements the test binding is real but weak — which is honest, but means "the mapping" a reader wants (requirement ⇒ a test that actually discharges it) genuinely is incomplete for the majority. The skill should make partial loud, because it's the true state and the thing to burn down.

6. How you instruct/adjust me — what works, what's ambiguous (explicitly requested)

Candid, since you asked:

  • Standing directives work best. "concentrate on meld", "Tier-5 needs Mythos before merge", "fail-safe toward keeping", "hallucinations cost more than silence" — these are durable and I applied them across many turns without re-prompting. More of these up front > per-step correction.
  • Per-step yes / "do it as you said" is efficient but compounds scope ambiguity. "Start 298 and integration/embedded-rt-no-grow" bundled two tasks; I had to decide ordering and how far to take each. When an instruction bundles work, an explicit "do X first, then Y" or "X is the priority" saves me guessing.
  • Mid-course corrections were high-value and I'd welcome more, earlier. "we partially do already together with gale and jess" corrected an overstated gap; "concentrate on meld" pulled me back from fork drift. These caught real drift. A periodic "is your current scope still right?" from me (or a skill that re-confirms scope each loop iteration) would catch it sooner.
  • One concrete miscommunication: 300 (issue ref) read as 300 (seconds) — see §3. Bare tokens mid-loop are ambiguous; I should ask, and/or the loop grammar should flag it.
  • What would help me serve you better: (a) when you start a loop, a one-line "definition of done / when to stop" so parking is unambiguous; (b) priority ordering when bundling tasks; (c) treat my AskUserQuestion as the place to put the decision (the Auto-default policy question from #300 is still open) — a couple of pending decisions accumulate because the loop keeps moving.

Concrete asks out of this report

  1. Add a cross-repo-coordination skill (§4).
  2. Add a human-facing traceability render (or extend traceability-audit) so requirement→test→status is visible co-located with requirements (§5.1); and make partial prominent (§5.2).
  3. Give /loop + issue-hunt an explicit park / no-actionable-work branch and a definition-of-done input, so a gated/quiet loop stops cleanly instead of pressuring toward make-work (§2, §3).
  4. Disambiguate bare integer tokens in /loop (interval vs issue ref) — ask, don't assume (§3, §6).

Happy to turn any of these into PRs against claude-tooling/plugins/pulseengine-claude.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions