Skip to content

fix(#230): explicit-delegation precheck — alias equality + self-exclusion + LLM fall-through#231

Closed
vansin wants to merge 6 commits into
mainfrom
fix/230-explicit-delegation-self-reflection
Closed

fix(#230): explicit-delegation precheck — alias equality + self-exclusion + LLM fall-through#231
vansin wants to merge 6 commits into
mainfrom
fix/230-explicit-delegation-self-reflection

Conversation

@vansin

@vansin vansin commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Author

Agent: 通信SDK马
Reviewer: 通信牛 (mutual reviewer pair; PR-4 / #228 just closed the loop, this PR continues the alternation)

Summary

Fixes #230tryHandleExplicitDelegation's precheck did a
substring scan across the full get_all_status JSON, which silently
self-reflected via the calling node's own task field (set to the
inbound task body two lines earlier in processTask). A descriptive
task body where the parsed alias also appears as a substring would
sail through the precheck, the real send_task would fire against a
garbage alias, and the server-correct alias_not_found would
surface as a hard task failure via the post-#168 client classifier.

The bug pre-dates the recent hub upgrade window — what changed in the
upgrade window was honest error reporting end-to-end
(server-side ea212fc + client-side 5b61d1c #168). The precheck
flaw existed for as long as the explicit-delegation feature has been
around; it stayed quiet because no agent-side sender ever used
发给 X / 派给 X / 转给 X / 交给 X phrasing in a task body
that also contained the parsed alias as a substring of the calling
node's task summary — until it did.

Two commits, intentionally split

Commit A — 292364c Minimum fix (alias equality + self-exclusion)

Pure helper in agent-node/src/runtime/delegation-precheck.ts:

delegationTargetExists(sessions, target, self) 
  { exists: true, source: "session" }
  | { exists: false, reason: "not_in_sessions" | "self_only" | "empty_sessions" | "no_sessions_field" }
  • Exact alias equality on the sessions array (not substring on full JSON)
  • Self-exclusion so the calling node's own row cannot satisfy the precheck even if a future code path makes self-delegation meaningful
  • Whitespace trimming so accidental padding doesn't mask a hit
  • Defensive skip on rows with missing / non-string alias fields

9 unit cases in delegation-precheck.test.ts covering happy path, the self-reflection repro, self-only match, typo, empty sessions, missing sessions field, empty target alias, whitespace, and malformed rows.

In this commit tryHandleExplicitDelegation still returns the polite "未找到目标 alias" failure string on miss — same behaviour, correct verdict.

Commit B — f9bd8a3 Orthogonal improvement (fall-through to LLM)

Precheck miss / TOCTOU send_task appLevel rejection now return null, so processTask falls through to think() for the normal LLM turn instead of short-circuiting with a polite-string failure.

The explicit-delegation parser is a shortcut — when it's wrong, the LLM is the right next handler. Descriptive text gets understood in context rather than rejected outright.

Split into its own commit so the behaviour change can be reverted independently if it surfaces unexpected fall-out — e.g. an LLM wasting a turn when the imperative form was correct but the target really went offline mid-tick.

Tests

agent-node/src/runtime/delegation-precheck.test.ts — 9 cases as listed above.

Full agent-node bun test src/: 173 pass / 0 fail (164 + 9 new).

The wire-up in cli.ts is a 3-line policy swap whose correctness is structurally implied by the helper's typed return — no separate cli.ts integration test is added.

Repro phrasing (redacted)

The triggering pattern is descriptive past-tense in the task body:

"...刚才发给 <Person> 的 <artifact>..."

Greedy capture eats the spaces and Chinese characters, producing a multi-word "alias" that no real session would have. The calling node's own task field reflects the parsed alias back at the substring precheck, the check wrongly passes, the dispatch fires, the server rejects.

<Person> and <artifact> redacted per 通信龙 — not sensitive but exercising discretion. The structural pattern is sufficient to reproduce.

Ship vehicle

Per #230 + the rename-family promote plan: rides in agent-node 2.4.11 Phase B (the next batch after the rename family chain). Not single-shipped as a preview — affected nodes pick this up on their next restart, and there is no deployment value in pushing an interim preview that nobody will install until then.

Affected node operator note (for 通信龙 to relay through Vincent UAT): when restarting nodes to pick up the 2.4.11 bundle, the new precheck takes effect immediately, and any task body that previously tripped the false positive (descriptive 发给/派给/转给/交给 X phrasing) will now be handled by the LLM instead of bouncing back as alias_not_found.

Refs

…tion

Rename family unifying-fix runtime lane (per #146 verdict). PR-1+2 (commit
2dc166c) gave us the server-side handles — tasks.from_node_id /
inbox.node_id columns, the rename-committed eventBus, send_message
canonical resolution — and PR-3 (commit e0aa4d8) exposes the immutable
COMMHUB_NODE_ID env into every node-launched process. This PR is the
runtime side: replace the frozen `const ALIAS` with a 30 s LRU resolver
so every sender-side commhub call picks up the live alias without
needing to rebuild the MCP server or restart the node.

ROOT CAUSE this PR addresses

`agent-node/src/cli.ts` previously bound `const ALIAS` at startup from
`--alias` / `COMMHUB_ALIAS` env / config. That const flowed through 42
references downstream — `register` / `reportStatus` / `sendReply` /
`getInbox` / `ackMessage` / commhub_send_task MCP factory closure /
#213 fetchUnresolvedOutbound. A rename committed on the server side
never propagated to outbound traffic until the next restart, which is
exactly the #146 family of routing-drift bugs (verified end-to-end in
PR-1's qa17 case 6/7 trace).

WHAT SHIPS

A. agent-node/src/runtime/current-alias.ts (NEW, ~150 LOC pure module).

   `CurrentAliasResolver` wraps a 30 s LRU around the canonical alias
   the server reports for this node_id:

   - `current()` — synchronous, returns the last-known alias. Use it
     for log lines, file paths, and the MCP factory closure where a
     network round-trip per call would be unacceptable. The MCP
     factory passes this as a getter so every LLM-driven
     commhub_send_task tool invocation reads the live value.
   - `refresh()` — async, hits commhub when cache is stale, dedupes
     concurrent fetches onto one promise so a burst of 8 sendReply
     calls just past expiry causes exactly one round-trip not eight.
     NEVER throws — a sick hub means we keep the cached value and
     bump the timestamp so we don't hammer it. Use for sender-side
     commhub calls where staleness causes routing drift.
   - `set()` — force-install when a server response (e.g. report_status)
     authoritatively confirms the canonical alias. Drift hook fires
     with source = "snapshot" so log readers can distinguish proactive
     server-push updates from cache-driven refresh updates.
   - `cachedAt = 0` sentinel for "never fetched" so the first
     refresh() actually hits the server even when called at logical
     time 0.

B. agent-node/src/cli.ts wire-up.

   - NODE_ID resolution gains a `process.env.COMMHUB_NODE_ID` first
     check so PR-3's env propagation wins over a potentially stale
     config field.
   - Single module-level `aliasResolver` instance. Fetch hook uses
     `GET /api/status?network_id=...` with a 2.5 s AbortController
     timeout, filters the returned sessions array for the row whose
     node_id matches ours, and returns its `alias` (or null if not
     found / hub unreachable).
   - `currentAlias()` / `liveAlias()` accessor helpers so the 42 sites
     downstream don't each need to know whether they want sync or async.
   - Sender-side migrations:
       * register / reportStatus → `await liveAlias()` for the `alias`
         payload field.
       * register also calls `aliasResolver.set(result.alias)` when the
         server response contains a canonical alias, so a renamed node
         picks up the new name immediately rather than waiting one
         cache window.
       * getInbox / ackMessage → `await liveAlias()`.
       * sendReply (cli.ts:677) → `from_session: await liveAlias()`.
       * explicit-delegation wrapper send_task (cli.ts:1738) → same.
       * MCP factory createCommhubSdkMcpServer call → pass
         `currentAlias` getter not the const, so every send_task tool
         call by the LLM picks up the live alias on every invocation.
   - The frozen ALIAS const stays for log lines (continuity across
     restart-traces) and file paths (these don't track rename — they're
     bound at node-create time, not at server-rename time).
   - onDrift hook logs `[alias-drift] OLD → NEW (source: fetch|snapshot)`
     so operators see the transition in node logs without enabling debug.

C. agent-node/src/commhub-mcp.ts.

   `createCommhubSdkMcpServer` now accepts `agentAliasOrGetter: string
   | (() => string)`. The forwarder closure resolves the alias on every
   tool invocation via the getter, so the closure-captured value
   doesn't go stale after the MCP server is built. Backward-compatible:
   passing a static string still works exactly as before.

D. agent-node/src/runtime/grok-build-acp/resume-hint.ts migration.

   `ListTasksHook` params type gains `from_node_id?: string`.
   `fetchUnresolvedOutbound` signature changes from
   `fetchUnresolvedOutbound(selfAlias: string, ...)` to
   `fetchUnresolvedOutbound({ nodeId, alias }, ...)`. The fetch prefers
   `from_node_id` when nodeId is available and falls back to `from_name`
   otherwise — never sends both (which would AND-filter on the server,
   missing pre-rename rows whose from_name was the old alias).
   cli.ts call site updated to pass `{ nodeId: NODE_ID || null, alias:
   currentAlias() }`.

TESTS

E. agent-node/src/runtime/current-alias.test.ts (NEW, 15 cases).

   Coverage matrix:
     - current() returns startup snapshot before any refresh
     - ageMs() reports Infinity / isFresh() false before first fetch
     - warm cache short-circuits — no fetch fired within TTL
     - expired cache hits the server, updates the alias, fires
       onDrift("fetch")
     - concurrent refresh() dedupe onto one fetch (verified via
       maxInflight counter)
     - fetch throwing keeps cached value + emits warn (includes
       cached alias name for ops visibility)
     - fetch returning null / empty string treated as "server doesn't
       know yet"
     - after failed fetch the cache timestamp bumps so we don't hammer
       a sick hub
     - set() updates the alias and fires onDrift("snapshot")
     - set() with same value is no-op for drift but bumps timestamp
     - set("") is ignored (defends against caller forgetting validation)
     - nodeId = null short-circuits refresh() — no fetch hook called
     - cacheTtlMs = 0 disables caching — every refresh() fetches
     - ageMs() / isFresh() reflect actual cache state for assertions

F. agent-node/src/runtime/grok-build-acp/resume-hint.test.ts.

   Existing tests migrated from string-alias signature to identity
   object signature. Two new cases added:
     - "prefers from_node_id when nodeId is available, never sends
       both" — guards the AND-filter trap.
     - "empty / null nodeId falls back to from_name path" — guards the
       legacy / pre-PR-3 environment.

Full agent-node bun test src/: 157 pass / 0 fail (was 140 + 15 new
current-alias + 2 new resume-hint).

BOUNDARY WITH PR-1+2 + PR-3

This PR depends on (already merged on main):
- PR-1+2 (2dc166c): tasks.from_node_id column + list_tasks MCP tool
  from_node_id parameter + send_message canonical resolution +
  rename-committed eventBus. Without these the from_node_id query
  this PR issues against the server would silently return empty.
- PR-3 (e0aa4d8): COMMHUB_NODE_ID env propagation from launcher into
  every node-launched process. Without this, the NODE_ID source would
  be limited to the node's config.json which can go stale.

Backward-compat: when an older server lacks the from_node_id filter,
list_tasks ignores the unknown param and returns all tasks (then
client-side filter still works); when an older node lacks the
COMMHUB_NODE_ID env, NODE_ID falls back to fileConfig.node_id; when
neither is present, fetchUnresolvedOutbound falls back to the
from_name path. Behaviour degrades gracefully one layer at a time.

REVIEWER

通信牛 (mutual reviewer pair per #146 staffing). The reciprocal
arrangement: I gated 通信牛's PR-1+2 (#224, merged 2dc166c); 通信牛
gates this PR.

REFS

- #146 verdict (this PR's spec): #146 (comment)
- PR-1+2 merge commit: 2dc166c
- PR-3 merge commit: e0aa4d8
- prior runtime fixes in the same area: 5b61d1c (#168 reply
  reliability), d206fdd (#213 resume hint), 21b0aa2 (#214 dim 5 A6
  startup log)
…ent-side identity filter

Address the two REQUEST_CHANGES from 通信牛's PR #228 二审 (verdict
comment 4677600473). Both substantive, both fixed in this iteration.

Issue ① — MCP closure unbounded staleness

The first revision wired the MCP factory closure to the SYNC
`currentAlias()` accessor. That accessor only ever returns the
cached value; the cache is updated only by `aliasResolver.set()`
(at register) or by an explicit `aliasResolver.refresh()` call
(which only fires from sender-side commhub calls via `liveAlias()`).
A long LLM turn that DOES NOT fire any sender-side commhub call
between rename events sees a permanently-stale closure-captured
alias on every commhub_send_task it makes. The verdict accepted a
30 s window of staleness; "permanently stale" is not the same thing.

Fix — three layers, each defending a different failure shape:

a) commhub-mcp.ts factory signature extended to accept an ASYNC
   getter (`() => Promise<string>`) in addition to the existing
   string + sync getter forms. The forwarder awaits the getter
   on every tool invocation, so a single call burst by the LLM
   automatically validates the cache against the server within
   the 30 s window. Cache hit is microseconds; cache miss is one
   round-trip with a 2.5 s budget. Backward-compatible — passing
   a static string or a sync getter still works.

b) cli.ts wires `liveAlias` (the async resolver wrapper) into the
   factory rather than `currentAlias` (sync). Every LLM-driven
   send_task / send_message tool call now revalidates.

c) cli.ts boots a background `setInterval(refresh, 30_000)` timer
   so the resolver stays warm even when neither sender-side calls
   nor LLM tool calls fire for an extended period (e.g. an idle
   node waiting for inbox). `.unref()` so the timer doesn't keep
   the process alive on its own. Cheap — one HTTP round-trip every
   30 s. Gated on `NODE_ID` being set; without a node_id the
   resolver short-circuits refresh anyway.

Issue ② — from_node_id query against an old commhub is unsafe

The first revision sent `from_node_id` as a `list_tasks` query
param whenever the node had a node_id. Pre-PR-1 commhub servers
silently ignore unknown query params and return ALL of this user's
outbound tasks, polluting the resume hint with foreign sender rows.
Without a guard, a stable v0.10.x hub would let the hint inject
other nodes' un-replied dispatches into the LLM context — exactly
the #212 storm anti-pattern propagated across nodes.

Fix — capability probe + client-side per-row identity filter (both
applied, defence in depth):

a) cli.ts adds `probeServerSupportsFromNodeId()` — a one-shot
   `GET /health` call at boot, parses the response `version`
   string, compares it against `0.8.6-preview.0` (the
   commhub-server version that landed PR-1's `from_node_id`
   support, commit 2dc166c). Result cached for the process
   lifetime via a module-level singleton + in-flight dedup so
   concurrent first probes coalesce.

   The version comparator handles `MAJOR.MINOR.PATCH` and
   `MAJOR.MINOR.PATCH-preview.N` shapes. Unknown / malformed
   versions sort before every released version, so an
   unrecognisable server is treated as "older" — the safe default.

b) `fetchUnresolvedOutbound` opts gain `serverSupportsFromNodeId`.
   When `true` AND the caller has a node_id, send `from_node_id`.
   Otherwise send `from_name` (the pre-PR-1 path). Never send both.

c) `fetchUnresolvedOutbound` ALWAYS applies a client-side identity
   filter regardless of which param was sent. Each returned row is
   checked: prefer `row.from_node_id === selfIdentity.nodeId` when
   the row has a node_id; otherwise demand
   `row.from_name === selfIdentity.alias`. Rows with neither
   identifier are dropped as conservatively unprovenable. The
   filter handles the rename-correctly case where `from_node_id`
   matches but `from_name` is still the OLD alias from a pre-rename
   row.

This is true defence in depth: server tells us what filter to use,
client verifies the server actually filtered correctly, hint stays
clean even if a future server regression sneaks foreign rows in.

TESTS

agent-node/src/runtime/grok-build-acp/resume-hint.test.ts gains
7 new cases:

- "sends from_node_id ONLY when probe confirmed server supports it"
- "without probe confirmation, never sends from_node_id (old-server
   safety)"
- "when probe explicitly returned false, falls back even with
   node_id available"
- "drops rows whose from_node_id does not match ours (server bug
   defence)"
- "when row has no from_node_id, falls back to from_name match"
- "drops rows with NEITHER from_node_id nor from_name (conservative)"
- "prefers from_node_id over from_name when both present (handles
   rename correctly)"
- "when WE have no nodeId, identity check uses from_name only"

Existing row() factory bumped to default `from_name: "self"` so the
new identity filter lets pre-existing test rows through; tests that
specifically exercise the filter override `from_node_id` /
`from_name` explicitly.

Full agent-node bun test src/: 164 pass / 0 fail (was 157 + 7 new).

REVIEWER

通信牛 — please re-verify both blocking items addressed plus the
new defence-in-depth filter behaviour. Mutual reviewer pair
continues; I gated your #224, you gate this one. Re-review
turnaround whenever you have a slot, no urgency.

REFS

- 通信牛 二审 verdict: #228 (comment)
- 通信龙 relay + accept both blocking items as substantive: dispatch
  d4995e5b
- Prior PR-4 commit: 65bd54a (squashed in this branch)
- PR-1+2 server merge: 2dc166c (provides /health version + from_node_id
  filter support starting at commhub-server 0.8.6-preview.0)
- PR-3 CLI merge: e0aa4d8 (COMMHUB_NODE_ID env propagation)
- tests/docker-config-priority.sh +1: `mkdir -p /root/.anet` before
  first `echo > /root/.anet/config.json` write. Pre-fix: `set -e` exits
  on permission error, Config Priority suite never reached Results line.
- tests/test-all.sh +11: sanity guard treats `0 passed + 0 failed` as
  CRASHED instead of silent green (issue #65 author recommendation
  to prevent future silent-skip rot).

Verified in Docker (anet-tests:65 build):
- Pre-fix: suite crashes at line 22 (`echo > config.json` fails),
  test-all shows `✅ Config Priority: 0 passed, 0 failed` (silent).
- Post-fix: suite reaches Step 1 ("Global hub fallback..."), then
  crashes on V3 auth (`anet node create` requires login) — this is
  #63's territory, properly out of scope here. test-all now shows
  `⚠️ Config Priority (16): SKIPPED/CRASHED (0 ran — ...)` and adds 1
  to TOTAL_FAIL.

Evidence: docs/tests/p-65-config-priority/{test-config-run.log, test-all-run.log}

红线: Docker only, host hub 未碰, /tmp/cfg-test workdir 即清。
Author-Agent: 通信测试马
Refs: #65, depends-on #63 for full suite green
V3 auth (RFC-001) rejects anonymous + master-token writes. Pre-fix §4 had:
- mcp_call() anonymous (no Authorization header) → send_task 401
- Query GETs anonymous → empty network-scoped responses

Fix (tests/docker-e2e-networks.sh):
- mcp_call() takes user token as $3, injects `Authorization: Bearer <tok>`
  on both initialize and tools/call legs
- §4 send_task calls pass matching owner token (TOKEN_A for alpha, TOKEN_B
  for beta)
- §4 query GETs `?network_id=...` add owner-matching Bearer header
- §5 stats GET adds owner Bearer header

Docker verify (anet-tests:64):
- Pre-fix: ~4 fail in §4/§5 (mcp_call anon + query anon)
- Post-fix: 20 passed, 2 failed
  - 2 remaining fails ("alpha/beta task missing") are deeper isolation
    semantic (send_task body's network_id may not bind tasks correctly to
    non-default networks when from_session="tester" has no node in that
    network) — beyond #64 fixture-auth scope, may be product-side or test
    redesign. Documented in evidence log.

Evidence: docs/tests/p-64-networks/{test-networks-run.log, fixed-script.sh}

红线: Docker only, host hub 未碰, /tmp workdir 即清。
Author-Agent: 通信测试马
Refs: #64 (partial close — 2/4 mcp_call auth fails fixed cleanly; 2 isolation
semantic fails deferred), depends-on #63 for any remaining anon REST sites
…sion

The previous precheck did
  `JSON.stringify(get_all_status).includes(parsed.alias)`
to decide whether the parsed target alias actually exists before
firing a real `send_task`. That substring scan covers every
session's `task` field — including the calling node's own row,
which it has just set to the inbound task body via
`reportStatus("working", task.slice(0, 200))` two lines earlier in
`processTask`.

When the parsed alias is itself a substring of the original task
text (the natural case for a descriptive rather than imperative
trigger phrase, e.g. `"...刚才发给 <某人> 的 <某物>..."`), the
calling node's own `task` field reflects the parsed alias back at
the precheck, the check wrongly passes, and the real `send_task`
fires against a garbage alias. The server correctly rejects with
`alias_not_found`, the post-#168 client classifier surfaces it as
an `appLevel` rejection, and `processTask` returns
`grok 错误: app-level rejection: alias_not_found` as the failed
task result — usually within ~1 s of receipt, before any LLM turn
has a chance to handle the descriptive text properly.

The bug pre-dates the recent hub upgrade window; it just stayed
quiet because no agent-side sender to the affected node had ever
used `发给 X` / `派给 X` / `转给 X` / `交给 X` imperative-like
phrasing in their task body. What changed in the upgrade window
was honest error reporting end-to-end (server-side `ea212fc` +
client-side `5b61d1c` #168), which surfaced the precheck false
positive instead of swallowing it.

Fix — agent-node/src/runtime/delegation-precheck.ts (new pure
module, 9 unit cases):

  delegationTargetExists(sessions, target, self) returns a typed
  result distinguishing:
    - exists: true  (target alias is online and is not us)
    - exists: false, reason: "not_in_sessions"
    - exists: false, reason: "self_only"
    - exists: false, reason: "empty_sessions"
    - exists: false, reason: "no_sessions_field"

  Exact alias equality on the `sessions` array, self-exclusion so
  the calling node's own row cannot satisfy the precheck even if a
  future code path makes self-delegation meaningful, whitespace
  trimming so accidental padding doesn't mask a real hit, and
  defensive skip on rows with missing / non-string alias fields.

agent-node/src/cli.ts wire-up swaps the inline substring check for
the helper call. The polite-string failure behaviour is preserved
in this commit — a follow-up commit on this PR converts the miss
into a `return null` so `processTask` can fall through to the LLM
instead of short-circuiting descriptive-text tasks entirely.

Test coverage in agent-node/src/runtime/delegation-precheck.test.ts:
  - imperative happy path (real other session is found)
  - the #230 self-reflection repro (descriptive-text false positive
    with the calling node's own `task` field containing the parsed
    alias as a substring; old substring check would pass, new
    equality check correctly reports not_in_sessions)
  - self-only match (target alias equals the caller's own alias →
    self_only)
  - typo alias
  - empty sessions / missing sessions field
  - empty target alias defended
  - whitespace trimming
  - sessions with missing / non-string alias fields skipped

Full agent-node bun test src/: 173 pass / 0 fail (164 + 9 new).

Refs:
- #230 (the issue)
- #168 / 5b61d1c (the client classifier whose honest error
  reporting surfaced the latent precheck bug)
- ea212fc (the server-side companion that started returning real
  alias_not_found instead of swallowing)
Orthogonal improvement on top of the alias-equality fix from the
previous commit. With the equality check in place, precheck misses
are now structurally correct ("the parsed alias really isn't an
online session"). But the response was still to short-circuit
`processTask` with a polite string failure, which means a
descriptive task body like
  `"...刚才发给 <某人> 的 <某物>..."`
gets parsed → precheck correctly misses → user gets a polite
"未找到目标 alias" reply, and the LLM never sees the actual task.

That's the wrong default. The parser is a heuristic that should
SHORTCUT the LLM when it's confident, not REPLACE the LLM when
it's wrong. When the precheck says the parsed target isn't there,
the LLM is the right next handler — it can read the description,
understand the user's actual intent, and either decline politely
in context or take a reasonable next step.

Behaviour change on this commit:

  - `tryHandleExplicitDelegation` returns `null` instead of a polite
    failure string when:
      a) the precheck reports the parsed alias isn't online, OR
      b) the precheck passes but the real `send_task` itself comes
         back with an `appLevel` CommHubError (TOCTOU: alias went
         offline between the precheck and the dispatch, or some
         future server rejection we haven't anticipated).
  - On (a) and (b), `processTask` falls through to `think()` →
    normal LLM turn handles the task.
  - On a non-app-level error (transport / 5xx), the throw still
    bubbles up — those are real infrastructure failures, not
    parser-disagreement signals, and the LLM can't do anything
    useful about them.

The `processTask` catch block at cli.ts:1948 still classifies any
throw from `think()` itself as `failed: true`, so the dispatcher
still sees a real error response when the LLM genuinely can't run.

Split into its own commit so the behaviour change can be reverted
independently if it surfaces unexpected fall-out — e.g. an LLM
that wastes a turn on a task whose imperative form was correct
and whose target really is offline. Test coverage for the parser
side stays in delegation-precheck.test.ts (covers the upstream
precheck result types); the wire-up here is a 3-line policy choice
whose correctness is implied by the helper's typed return.

Refs:
- #230 (the issue)
- Companion commit on this PR: alias equality + self-exclusion
@vansin

vansin commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Closing in favour of #232 — clean branch rebased on origin/main after 通信龙 catch (the original branch was rooted on fix/146-pr4-alias-getter, and PR-4 squash-merged as 2e9fae0 meant the squash dropped the original commits from main's history, causing GitHub to surface PR-4's full diff in this review face). Same two commits cherry-picked clean.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9bd8a306c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent-node/src/cli.ts
// every launched node, fall back to the config field for back-compat with
// nodes started before PR-3 landed. The env wins because the launcher
// always knows the canonical node id; a stale config file would mislead.
const NODE_ID = process.env.COMMHUB_NODE_ID || fileConfig.node_id || "";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep config node_id ahead of COMMHUB_NODE_ID

When a user starts agent-node with an explicit config while their shell still has COMMHUB_NODE_ID exported from another node, this line makes the process adopt the stale env identity instead of fileConfig.node_id. That changes RESUME_ID, report_status.node_id, and the alias resolver lookup to the wrong node, so the agent can overwrite another session or start sending/acking under the other node's canonical alias. The launcher path you added does set the env correctly, but manual/debug starts and inherited environments still need the loaded config to win when it contains a node_id.

Useful? React with 👍 / 👎.

Comment thread agent-node/src/cli.ts
Comment on lines +748 to +749
const alias = await liveAlias();
return (await callCommHub("get_inbox", { alias, limit: 20 }))?.messages || [];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Poll old-alias inbox rows after rename

If a task was already queued before a rename and not yet acked, switching get_inbox to only the live canonical alias strands that message: commitRename explicitly keeps inbox history under the old alias, and get_inbox still filters only inbox.session_name = alias rather than node_id. In that rename-with-pending-inbox scenario the node stops seeing the old row as soon as the resolver refreshes, so the task is never processed unless another path migrates or queries by node_id/old alias.

Useful? React with 👍 / 👎.

Comment thread agent-node/src/cli.ts
return false;
}
const body = (await res.json()) as { version?: string };
const supported = compareCommhubVersion(body.version, "0.8.6-preview.0") >= 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't disable node-id resume hints on current hubs

The capability probe currently treats this repo's own server as unsupported: /health returns SERVER_VERSION from server/package.json, which is still 0.8.5, while list_tasks already accepts from_node_id. Against that hub this branch always falls back to from_name, so after a node rename the Grok resume hint misses pre-rename outbound tasks and the duplicate-dispatch guardrail it was meant to provide does not run.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug][P1] explicit-delegation precheck self-reflection — own task field pollutes JSON substring match → alias_not_found false failure

2 participants