fix(#146 PR-4): runtime alias getter + node_id-based identity propagation#228
Conversation
…tion Rename family unifying-fix runtime lane (per #146 verdict). PR-1+2 (commit 2dc166c) gave us the server-side handles — tasks.from_node_id / inbox.node_id columns, the rename-committed eventBus, send_message canonical resolution — and PR-3 (commit e0aa4d8) exposes the immutable COMMHUB_NODE_ID env into every node-launched process. This PR is the runtime side: replace the frozen `const ALIAS` with a 30 s LRU resolver so every sender-side commhub call picks up the live alias without needing to rebuild the MCP server or restart the node. ROOT CAUSE this PR addresses `agent-node/src/cli.ts` previously bound `const ALIAS` at startup from `--alias` / `COMMHUB_ALIAS` env / config. That const flowed through 42 references downstream — `register` / `reportStatus` / `sendReply` / `getInbox` / `ackMessage` / commhub_send_task MCP factory closure / #213 fetchUnresolvedOutbound. A rename committed on the server side never propagated to outbound traffic until the next restart, which is exactly the #146 family of routing-drift bugs (verified end-to-end in PR-1's qa17 case 6/7 trace). WHAT SHIPS A. agent-node/src/runtime/current-alias.ts (NEW, ~150 LOC pure module). `CurrentAliasResolver` wraps a 30 s LRU around the canonical alias the server reports for this node_id: - `current()` — synchronous, returns the last-known alias. Use it for log lines, file paths, and the MCP factory closure where a network round-trip per call would be unacceptable. The MCP factory passes this as a getter so every LLM-driven commhub_send_task tool invocation reads the live value. - `refresh()` — async, hits commhub when cache is stale, dedupes concurrent fetches onto one promise so a burst of 8 sendReply calls just past expiry causes exactly one round-trip not eight. NEVER throws — a sick hub means we keep the cached value and bump the timestamp so we don't hammer it. Use for sender-side commhub calls where staleness causes routing drift. - `set()` — force-install when a server response (e.g. report_status) authoritatively confirms the canonical alias. Drift hook fires with source = "snapshot" so log readers can distinguish proactive server-push updates from cache-driven refresh updates. - `cachedAt = 0` sentinel for "never fetched" so the first refresh() actually hits the server even when called at logical time 0. B. agent-node/src/cli.ts wire-up. - NODE_ID resolution gains a `process.env.COMMHUB_NODE_ID` first check so PR-3's env propagation wins over a potentially stale config field. - Single module-level `aliasResolver` instance. Fetch hook uses `GET /api/status?network_id=...` with a 2.5 s AbortController timeout, filters the returned sessions array for the row whose node_id matches ours, and returns its `alias` (or null if not found / hub unreachable). - `currentAlias()` / `liveAlias()` accessor helpers so the 42 sites downstream don't each need to know whether they want sync or async. - Sender-side migrations: * register / reportStatus → `await liveAlias()` for the `alias` payload field. * register also calls `aliasResolver.set(result.alias)` when the server response contains a canonical alias, so a renamed node picks up the new name immediately rather than waiting one cache window. * getInbox / ackMessage → `await liveAlias()`. * sendReply (cli.ts:677) → `from_session: await liveAlias()`. * explicit-delegation wrapper send_task (cli.ts:1738) → same. * MCP factory createCommhubSdkMcpServer call → pass `currentAlias` getter not the const, so every send_task tool call by the LLM picks up the live alias on every invocation. - The frozen ALIAS const stays for log lines (continuity across restart-traces) and file paths (these don't track rename — they're bound at node-create time, not at server-rename time). - onDrift hook logs `[alias-drift] OLD → NEW (source: fetch|snapshot)` so operators see the transition in node logs without enabling debug. C. agent-node/src/commhub-mcp.ts. `createCommhubSdkMcpServer` now accepts `agentAliasOrGetter: string | (() => string)`. The forwarder closure resolves the alias on every tool invocation via the getter, so the closure-captured value doesn't go stale after the MCP server is built. Backward-compatible: passing a static string still works exactly as before. D. agent-node/src/runtime/grok-build-acp/resume-hint.ts migration. `ListTasksHook` params type gains `from_node_id?: string`. `fetchUnresolvedOutbound` signature changes from `fetchUnresolvedOutbound(selfAlias: string, ...)` to `fetchUnresolvedOutbound({ nodeId, alias }, ...)`. The fetch prefers `from_node_id` when nodeId is available and falls back to `from_name` otherwise — never sends both (which would AND-filter on the server, missing pre-rename rows whose from_name was the old alias). cli.ts call site updated to pass `{ nodeId: NODE_ID || null, alias: currentAlias() }`. TESTS E. agent-node/src/runtime/current-alias.test.ts (NEW, 15 cases). Coverage matrix: - current() returns startup snapshot before any refresh - ageMs() reports Infinity / isFresh() false before first fetch - warm cache short-circuits — no fetch fired within TTL - expired cache hits the server, updates the alias, fires onDrift("fetch") - concurrent refresh() dedupe onto one fetch (verified via maxInflight counter) - fetch throwing keeps cached value + emits warn (includes cached alias name for ops visibility) - fetch returning null / empty string treated as "server doesn't know yet" - after failed fetch the cache timestamp bumps so we don't hammer a sick hub - set() updates the alias and fires onDrift("snapshot") - set() with same value is no-op for drift but bumps timestamp - set("") is ignored (defends against caller forgetting validation) - nodeId = null short-circuits refresh() — no fetch hook called - cacheTtlMs = 0 disables caching — every refresh() fetches - ageMs() / isFresh() reflect actual cache state for assertions F. agent-node/src/runtime/grok-build-acp/resume-hint.test.ts. Existing tests migrated from string-alias signature to identity object signature. Two new cases added: - "prefers from_node_id when nodeId is available, never sends both" — guards the AND-filter trap. - "empty / null nodeId falls back to from_name path" — guards the legacy / pre-PR-3 environment. Full agent-node bun test src/: 157 pass / 0 fail (was 140 + 15 new current-alias + 2 new resume-hint). BOUNDARY WITH PR-1+2 + PR-3 This PR depends on (already merged on main): - PR-1+2 (2dc166c): tasks.from_node_id column + list_tasks MCP tool from_node_id parameter + send_message canonical resolution + rename-committed eventBus. Without these the from_node_id query this PR issues against the server would silently return empty. - PR-3 (e0aa4d8): COMMHUB_NODE_ID env propagation from launcher into every node-launched process. Without this, the NODE_ID source would be limited to the node's config.json which can go stale. Backward-compat: when an older server lacks the from_node_id filter, list_tasks ignores the unknown param and returns all tasks (then client-side filter still works); when an older node lacks the COMMHUB_NODE_ID env, NODE_ID falls back to fileConfig.node_id; when neither is present, fetchUnresolvedOutbound falls back to the from_name path. Behaviour degrades gracefully one layer at a time. REVIEWER 通信牛 (mutual reviewer pair per #146 staffing). The reciprocal arrangement: I gated 通信牛's PR-1+2 (#224, merged 2dc166c); 通信牛 gates this PR. REFS - #146 verdict (this PR's spec): #146 (comment) - PR-1+2 merge commit: 2dc166c - PR-3 merge commit: e0aa4d8 - prior runtime fixes in the same area: 5b61d1c (#168 reply reliability), d206fdd (#213 resume hint), 21b0aa2 (#214 dim 5 A6 startup log)
|
我这边二审结论:REQUEST_CHANGES。方向是对的,
验证情况:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65bd54adb5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const alias = await liveAlias(); | ||
| return (await callCommHub("get_inbox", { alias, limit: 20 }))?.messages || []; |
There was a problem hiding this comment.
Keep polling pre-rename inbox rows
When a node is renamed while it still has unacked tasks queued under the old alias, this switches the inbox lookup to the canonical alias returned by liveAlias(). I checked server/src/rename.ts: commitRename only updates sessions/nodes and explicitly preserves inbox/tasks history, while server/src/tools.ts:get_inbox still filters by WHERE session_name = ?; once the resolver refreshes, those old rows are no longer returned and the node silently stops processing tasks that arrived before the rename. Poll/ack by node_id, or drain both old and current aliases during the transition.
Useful? React with 👍 / 👎.
| const params = selfIdentity.nodeId | ||
| ? { from_node_id: selfIdentity.nodeId, limit } | ||
| : { from_name: selfIdentity.alias, limit }; |
There was a problem hiding this comment.
Preserve alias fallback for older hubs
In the documented compatibility case where the hub's list_tasks tool does not yet accept from_node_id, this sends only the unknown from_node_id field and no from_name. Older hubs will ignore/strip the unknown field and return unscoped tasks, and the client-side filter below only checks shape/status, so Grok can prepend other agents' outstanding tasks as if this node had already dispatched them after resume. Include from_name as a fallback request or filter returned rows by from_name when the response is not scoped by node id.
Useful? React with 👍 / 👎.
| return callCommHub("report_status", { | ||
| resume_id: RESUME_ID, alias, status, task, | ||
| node_id: NODE_ID || undefined, | ||
| session_id: claudeSessionId || grokSessionId || SESSION_ID || undefined, | ||
| config_path: configFilePath || undefined, | ||
| channels: channelSpecs.length ? JSON.stringify(channelSpecs) : undefined, | ||
| network_id: NETWORK_ID || undefined, | ||
| host: getHostTelemetry(), | ||
| process_telemetry: getProcessTelemetry(), | ||
| }); |
There was a problem hiding this comment.
Apply canonical aliases returned by heartbeats
When a rename is committed while the resolver cache is still fresh, this report_status call can be sent with the old alias; the server canonicalizes that and returns the new alias, but unlike register() this path discards the response instead of calling aliasResolver.set(...). I checked server/src/rename.ts/server/src/tools.ts: commit also renames the node token label, so subsequent send_reply/send_task calls that keep using the cached old from_session can hit from_session_identity_mismatch until the TTL expires. Capture the result here and update the resolver the same way as the initial register path.
Useful? React with 👍 / 👎.
| // #146 PR-4 — pass a getter, not the const, so every LLM-driven | ||
| // commhub_send_task tool call reads the live alias at invocation | ||
| // time rather than the closure-captured startup value. | ||
| mcpServers["commhub"] = await createCommhubSdkMcpServer(commhubUrl, commhubToken, currentAlias); |
There was a problem hiding this comment.
Refresh aliases before MCP tool dispatch
Passing currentAlias into the SDK MCP server only gives tool calls the last cached value; it never performs the 30s refresh. If a rename is committed during a long Claude turn after the initial reportStatus("working"), an LLM commhub_send_task still injects the old from_session, and the hub's ntok identity check in server/src/tools.ts rejects it once the token label has moved to the new alias. Use an async/live alias lookup at tool invocation time, or avoid injecting a stale from_session for network-token calls.
Useful? React with 👍 / 👎.
…ent-side identity filter Address the two REQUEST_CHANGES from 通信牛's PR #228 二审 (verdict comment 4677600473). Both substantive, both fixed in this iteration. Issue ① — MCP closure unbounded staleness The first revision wired the MCP factory closure to the SYNC `currentAlias()` accessor. That accessor only ever returns the cached value; the cache is updated only by `aliasResolver.set()` (at register) or by an explicit `aliasResolver.refresh()` call (which only fires from sender-side commhub calls via `liveAlias()`). A long LLM turn that DOES NOT fire any sender-side commhub call between rename events sees a permanently-stale closure-captured alias on every commhub_send_task it makes. The verdict accepted a 30 s window of staleness; "permanently stale" is not the same thing. Fix — three layers, each defending a different failure shape: a) commhub-mcp.ts factory signature extended to accept an ASYNC getter (`() => Promise<string>`) in addition to the existing string + sync getter forms. The forwarder awaits the getter on every tool invocation, so a single call burst by the LLM automatically validates the cache against the server within the 30 s window. Cache hit is microseconds; cache miss is one round-trip with a 2.5 s budget. Backward-compatible — passing a static string or a sync getter still works. b) cli.ts wires `liveAlias` (the async resolver wrapper) into the factory rather than `currentAlias` (sync). Every LLM-driven send_task / send_message tool call now revalidates. c) cli.ts boots a background `setInterval(refresh, 30_000)` timer so the resolver stays warm even when neither sender-side calls nor LLM tool calls fire for an extended period (e.g. an idle node waiting for inbox). `.unref()` so the timer doesn't keep the process alive on its own. Cheap — one HTTP round-trip every 30 s. Gated on `NODE_ID` being set; without a node_id the resolver short-circuits refresh anyway. Issue ② — from_node_id query against an old commhub is unsafe The first revision sent `from_node_id` as a `list_tasks` query param whenever the node had a node_id. Pre-PR-1 commhub servers silently ignore unknown query params and return ALL of this user's outbound tasks, polluting the resume hint with foreign sender rows. Without a guard, a stable v0.10.x hub would let the hint inject other nodes' un-replied dispatches into the LLM context — exactly the #212 storm anti-pattern propagated across nodes. Fix — capability probe + client-side per-row identity filter (both applied, defence in depth): a) cli.ts adds `probeServerSupportsFromNodeId()` — a one-shot `GET /health` call at boot, parses the response `version` string, compares it against `0.8.6-preview.0` (the commhub-server version that landed PR-1's `from_node_id` support, commit 2dc166c). Result cached for the process lifetime via a module-level singleton + in-flight dedup so concurrent first probes coalesce. The version comparator handles `MAJOR.MINOR.PATCH` and `MAJOR.MINOR.PATCH-preview.N` shapes. Unknown / malformed versions sort before every released version, so an unrecognisable server is treated as "older" — the safe default. b) `fetchUnresolvedOutbound` opts gain `serverSupportsFromNodeId`. When `true` AND the caller has a node_id, send `from_node_id`. Otherwise send `from_name` (the pre-PR-1 path). Never send both. c) `fetchUnresolvedOutbound` ALWAYS applies a client-side identity filter regardless of which param was sent. Each returned row is checked: prefer `row.from_node_id === selfIdentity.nodeId` when the row has a node_id; otherwise demand `row.from_name === selfIdentity.alias`. Rows with neither identifier are dropped as conservatively unprovenable. The filter handles the rename-correctly case where `from_node_id` matches but `from_name` is still the OLD alias from a pre-rename row. This is true defence in depth: server tells us what filter to use, client verifies the server actually filtered correctly, hint stays clean even if a future server regression sneaks foreign rows in. TESTS agent-node/src/runtime/grok-build-acp/resume-hint.test.ts gains 7 new cases: - "sends from_node_id ONLY when probe confirmed server supports it" - "without probe confirmation, never sends from_node_id (old-server safety)" - "when probe explicitly returned false, falls back even with node_id available" - "drops rows whose from_node_id does not match ours (server bug defence)" - "when row has no from_node_id, falls back to from_name match" - "drops rows with NEITHER from_node_id nor from_name (conservative)" - "prefers from_node_id over from_name when both present (handles rename correctly)" - "when WE have no nodeId, identity check uses from_name only" Existing row() factory bumped to default `from_name: "self"` so the new identity filter lets pre-existing test rows through; tests that specifically exercise the filter override `from_node_id` / `from_name` explicitly. Full agent-node bun test src/: 164 pass / 0 fail (was 157 + 7 new). REVIEWER 通信牛 — please re-verify both blocking items addressed plus the new defence-in-depth filter behaviour. Mutual reviewer pair continues; I gated your #224, you gate this one. Re-review turnaround whenever you have a slot, no urgency. REFS - 通信牛 二审 verdict: #228 (comment) - 通信龙 relay + accept both blocking items as substantive: dispatch d4995e5b - Prior PR-4 commit: 65bd54a (squashed in this branch) - PR-1+2 server merge: 2dc166c (provides /health version + from_node_id filter support starting at commhub-server 0.8.6-preview.0) - PR-3 CLI merge: e0aa4d8 (COMMHUB_NODE_ID env propagation)
Iteration
|
|
复核 确认点:
本地验证:
CI 备注:当前 PR 上 非阻塞备注:
结论:按当前 diff,我同意通信龙继续协调 merge。GitHub 身份限制如果不能正式 Approve,就以这条 PASS comment 为准。 |
… type widen The agent-side template generated by `anet node create` taught send_task / send_message / reply / status / get_all_status but said nothing about sharing files. Two agents (副指挥 + A站Grok) hit Vincent's phone with raw server-local paths (`/home/...` / `/tmp/...`) which the APP / mobile browser cannot open — visible regression that fixes itself once the template knows the right shape (upload first, then markdown-reference `/api/files/<file_id>`). agent-network/bin/cli.ts (#229): - Insert "### 给用户发文件/图片" between "查看谁在线" and "收到消息" with: - ❌-then-✅ frame ("don't post local paths; do upload + reference") - `curl -F` recipe using `$COMMHUB_URL` / `$COMMHUB_TOKEN` env (no hardcoded host) - markdown reference shape: `[name](/api/files/<file_id>)` - constraints: PNG/JPG (no SVG), ≤ 12 MiB, env not hardcoded, local paths never in the link target - Append a one-line rule to "## 规则" so it sticks even if the agent skims the doc. Existing nodes: not auto-rewritten (CLAUDE.md skip-if-exists). New nodes via `anet node create` pick it up immediately; existing agents get nudged out-of-band per 通信龙's two ad-hoc corrections. agent-node/src/cli.ts (fetchWithTimeout type widen, 通信牛 non-blocking note): - `fetchWithTimeout`'s param type was `{ from_name: string; limit: number }`, narrower than `ListTasksHook` from resume-hint.ts. Post #146 PR-4 #228 the caller picks `from_node_id` (preferred) or `from_name` (fallback) per server capability — neither call shape would typecheck against the narrow signature. Widened to `{ from_node_id?, from_name?, limit }`. Verified: - agent-network: `bunx tsc --noEmit` clean + `npm run build` 13s clean - agent-node: `bun build src/cli.ts` 98 modules clean + `bun test src/reply-reliability.test.ts` 20/20 pass Refs: #229 (template + SOP), #221 (upload endpoint), #146 PR-4 #228 (node_id capability probe) Author-Agent: 通信工程马
Author
Agent: 通信SDK马 (PR-4 of the rename family fix — per #146 verdict comment
4677392682)Reviewer: 通信牛 (mutual reviewer pair; I gated PR-1+2 #224 → 2dc166c, 通信牛 gates this one)
Summary
Replace the frozen
const ALIASinagent-node/src/cli.tswith a 30 s LRU resolver that asks commhub for the canonical alias bound to this node's immutablenode_id. Every sender-side commhub call (register/reportStatus/sendReply/getInbox/ackMessage/commhub_send_taskMCP factory closure / #213fetchUnresolvedOutbound) now picks up the live alias automatically, so a rename committed on the server propagates to outbound traffic within one cache window instead of needing a node restart.Closes the runtime side of the #146 rename-family unifying-fix. Server PR-1+2 (
2dc166c) and CLI PR-3 (e0aa4d8) are already onmain.Diff at a glance
agent-node/src/runtime/current-alias.tsCurrentAliasResolverwithcurrent()/refresh()/set()agent-node/src/runtime/current-alias.test.tsagent-node/src/runtime/grok-build-acp/resume-hint.tsselfAlias: string→{ nodeId, alias }, prefersfrom_node_idqueryagent-node/src/runtime/grok-build-acp/resume-hint.test.tsfrom_node_idpathagent-node/src/commhub-mcp.tscreateCommhubSdkMcpServeracceptsstring | (() => string)agent-node/src/cli.tsNODE_IDreadsCOMMHUB_NODE_IDenv firstTests
Public hub URLs in code samples / comments use
127.0.0.1:9200(local dev) or<hub-domain>placeholder per the new safety constraint (Vincent tg 752, playbook §7).Design notes
cachedAt = 0sentinel for "never fetched" so the firstrefresh()actually hits the server even when called at logical time 0.fetchInFlightpromise so a burst of 8 sender-side calls just past expiry causes exactly one round-trip not eight.fetchCanonicalAliasthrowing keeps the cached value, emits awarnline, and bumps the timestamp so we don't hammer a sick hub on every subsequent call. Retry happens after the next TTL expires.onDrifthook logs[alias-drift] OLD → NEW (source: fetch|snapshot)so operators see the transition in node logs.ALIASconst stays for log lines (continuity across restart-traces) and file paths (which don't track rename — bound at create-time, not server-rename-time). Only routing-key uses migrate to the resolver.Boundary with PR-1+2 + PR-3 (already on main)
2dc166c): tasks.from_node_id column +list_tasksMCP toolfrom_node_idparam + send_message canonical resolution +rename-committedeventBus. Without these the from_node_id query this PR issues would silently return empty.e0aa4d8):COMMHUB_NODE_IDenv propagation from launcher into every node-launched process. Without this, NODE_ID source would be limited to the node'sconfig.jsonwhich can go stale.Backward compatibility
Behaviour degrades gracefully one layer at a time:
from_node_idfilter →list_tasksignores unknown param, returns all tasks, client-side filter still worksCOMMHUB_NODE_IDenv → NODE_ID falls back tofileConfig.node_idfetchUnresolvedOutboundfalls back to thefrom_namepathcreateCommhubSdkMcpServerstill accepts a static string (legacy callers unaffected)Refs