From a5edae6e4618617aefd4ae51234b9cfdc1424261 Mon Sep 17 00:00:00 2001 From: "SYM.BOT" Date: Wed, 29 Apr 2026 17:45:57 +0100 Subject: [PATCH 1/2] 0.5.3: TCP keepalive + lastSeen-aware stale detection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v0.5.2 added stale-prior detection via the `_closed` flag. That fired when transport.close() had been called explicitly, but missed the common dead-but-ESTABLISHED case: a peer process is killed, its OS sends FIN on its sockets, but on the survivor side the TCP socket sits in ESTABLISHED state until OS keepalive reaps it. On Wi-Fi this isn't very visible — mobile/Wi-Fi TCP routes are noisy (route flaps, ARP timeouts, AP transitions, carrier hand-offs) and keepalive idle defaults are short. Stale sockets die in seconds. On loopback (two processes on the same host) it's a hard block. macOS default TCP_KEEPALIVE = 7200s (2 hours) before the first probe. A killed-and-relaunched same-host peer stays "ESTABLISHED" on the survivor for 2h, and the dedup logic against this zombie entry rejects every redial. Observed: Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same Mac. Mac restart for rebuild → claude-code-mac retains dead ESTABLISHED socket → new Mac dial rejected forever. iPhone↔ claude-code-mac on Wi-Fi works because Wi-Fi noise reaps stale sockets quickly. Three changes: 1. TcpTransport sets socket.setKeepAlive(true, 1000) — 1s initial idle delay, OS-default probe cadence after. macOS detects dead remote within ~10s instead of ~2h. 2. inbound-connection handler and _createPeer now treat a peer entry with lastSeen older than _heartbeatInterval (default 10s) as stale, regardless of the _closed flag. The remote re-dialling is itself strong evidence the prior is dead — it wouldn't dial again if it had a healthy connection. Catches the gap before keepalive fires + any future dead-ESTABLISHED scenarios. 3. Identity-aware close handlers. When a stale prior is closed and replaced, its eventual close handler must NOT clobber the new transport entry. Both close handlers (existingPeer branch and new-peer branch) now check `transports.get(source) === transport` before mutating the dict. 150/150 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/node.js | 57 ++++++++++++++++++++++++++++++++++++++++++++---- lib/transport.js | 10 +++++++++ package.json | 2 +- 3 files changed, 64 insertions(+), 5 deletions(-) diff --git a/lib/node.js b/lib/node.js index fd5d604..75b68d6 100644 --- a/lib/node.js +++ b/lib/node.js @@ -535,7 +535,24 @@ class SymNode extends EventEmitter { const existingPeer = this._peers.get(peerId); if (existingPeer && existingPeer.transports.has('bonjour')) { const prev = existingPeer.transports.get('bonjour'); - if (prev && !prev._closed) { + // Stale-prior detection. Two stale signals — either is sufficient + // to treat the prior as dead and let the new connection replace it: + // * `_closed` set: socket has explicitly been closed. + // * `lastSeen` older than one heartbeat interval: socket may be + // ESTABLISHED-but-half-open (peer process killed without FIN + // reaching us). Loopback in particular can keep dead sockets in + // ESTABLISHED state for ~2h before macOS keepalive reaps them, + // blocking every redial against the zombie entry. The remote + // re-dialling is itself strong evidence the prior is dead — it + // wouldn't dial again if it had a healthy connection. Bound by + // `_heartbeatInterval` (default 10s) — short enough that a + // restarting peer reconnects within seconds, long enough that + // a momentary lull during initial handshake doesn't trip it. + const staleByFlag = !prev || prev._closed; + const staleByLastSeen = existingPeer.lastSeen + && (Date.now() - existingPeer.lastSeen) > this._heartbeatInterval; + const stalePrior = staleByFlag || staleByLastSeen; + if (!stalePrior) { // Determine prior direction. The peer's `isOutbound` field reflects // the direction at first transport creation; for a single-transport // bonjour peer (the common case) this matches the live transport. @@ -561,8 +578,13 @@ class SymNode extends EventEmitter { // existingPeer.transports; _createPeer below will then add the new // transport without the "same source already open" guard rejecting it. prev.close(); + } else if (prev && !prev._closed) { + // Stale by lastSeen but flag not yet set — close it explicitly so + // its close handler runs and removes the dict entry before we add + // the new transport. + prev.close(); } - // else: (a) stale prior — fall through to accept + // else: stale-by-flag prior — fall through to accept } if (handshakeMsg.e2ePublicKey) { this._deriveAndStoreSecret(peerId, handshakeMsg.e2ePublicKey); @@ -1167,13 +1189,21 @@ class SymNode extends EventEmitter { // Section 4.6: add as secondary transport (different source — e.g. relay // on top of bonjour). For the SAME source, three cases: // - // (a) stale prior (_closed set) — replace with new + // (a) stale prior — replace with new. Stale = either `_closed` flag + // set OR `lastSeen` older than heartbeat interval (the second + // catches dead-but-ESTABLISHED loopback sockets that the OS + // hasn't yet reaped via keepalive — see inbound-connection + // handler comment for the long-form rationale). // (b) same-direction duplicate (prior and new share isOutbound) — // keep prior alive, drop new // (c) dual-dial — opposite directions. Tie-break by nodeId: lower // nodeId acts as client and keeps outbound; higher keeps inbound const existingTransport = existingPeer.transports.get(source); - if (existingTransport && !existingTransport._closed) { + const staleByFlag = !existingTransport || existingTransport._closed; + const staleByLastSeen = existingPeer.lastSeen + && (Date.now() - existingPeer.lastSeen) > this._heartbeatInterval; + const stalePrior = staleByFlag || staleByLastSeen; + if (existingTransport && !stalePrior) { const prevIsOutbound = !!existingPeer.isOutbound; const isDualDial = prevIsOutbound !== isOutbound; if (!isDualDial) { @@ -1193,11 +1223,25 @@ class SymNode extends EventEmitter { // removes the source entry, then fall through to set the new one. existingTransport.close(); existingPeer.isOutbound = isOutbound; + } else if (existingTransport && stalePrior && !existingTransport._closed) { + // Stale prior detected by lastSeen-age. Close it explicitly so its + // close handler can't fire later and `transports.delete(source)` away + // the new transport we're about to register. (If the close handler + // races and runs after the .set() below, it would silently revert + // this peer to a transport-less state.) + existingTransport.close(); + existingPeer.isOutbound = isOutbound; } existingPeer.transports.set(source, transport); this._log(`Transport added for ${peerName}: ${source} (${existingPeer.transports.size} transports)`); + // Identity-aware close handler. A previously-closed transport for the + // same source has its own pending close handler that, if it fires after + // we replace the entry here, would `transports.delete(source)` away + // this fresh transport. Guard by checking the current entry IS this + // transport before mutating. transport.on('close', () => { + if (existingPeer.transports.get(source) !== transport) return; existingPeer.transports.delete(source); this._log(`Transport closed for ${peerName}: ${source} (${existingPeer.transports.size} remaining)`); @@ -1225,6 +1269,11 @@ class SymNode extends EventEmitter { const peer = { transport, transports, peerId, name: peerName, isOutbound, source, lastSeen: Date.now() }; transport.on('close', () => { + // Identity-aware: only delete if this transport is still the registered + // one. A later replacement (dual-dial / stale-prior swap) may have set + // a different transport at this source key; a stale close here must + // not clobber it. (Same guard as the existingPeer branch above.) + if (transports.get(source) !== transport) return; transports.delete(source); this._log(`Transport closed for ${peerName}: ${source} (${transports.size} remaining)`); diff --git a/lib/transport.js b/lib/transport.js index c08b89b..1e35928 100644 --- a/lib/transport.js +++ b/lib/transport.js @@ -32,6 +32,16 @@ class TcpTransport extends EventEmitter { this._parser = new FrameParser(); this._closed = false; + // TCP keepalive: detect half-open connections (peer process killed + // without graceful FIN) faster than the OS default (~2h on macOS). + // Without this, a peer that crashes leaves us with an ESTABLISHED + // socket that survives long enough to block legitimate redials — + // dedup logic against the stale entry rejects the live new connection. + // 1s initial delay then OS-default probe cadence (typically detects + // dead remote within ~10s on macOS, faster than waiting for the + // application heartbeat timeout). + try { socket.setKeepAlive(true, 1000); } catch {} + socket.on('data', (chunk) => this._parser.feed(chunk)); this._parser.on('message', (msg) => this.emit('message', msg)); this._parser.on('error', (err) => this.emit('error', err)); diff --git a/package.json b/package.json index ed85ca6..af0f76f 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@sym-bot/sym", - "version": "0.5.2", + "version": "0.5.3", "description": "Infrastructure and protocol for multi-agent collective intelligence", "main": "lib/node.js", "bin": { From 8a682168a0799ca3670a05a212b431a5eccdf997 Mon Sep 17 00:00:00 2001 From: "SYM.BOT" Date: Wed, 29 Apr 2026 17:46:36 +0100 Subject: [PATCH 2/2] docs: changelog for 0.5.3 (TCP keepalive + lastSeen stale detection) --- CHANGELOG.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index ac89a8e..542df1d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,58 @@ > **Note:** Versions 0.3.26 – 0.3.55 were released as git tags without changelog entries. Changelog resumes at 0.3.56 below. +## 0.5.3 + +### Fixed + +- **Same-host loopback peers stayed permanently rejected after one peer + restarted.** Companion fix to 0.5.2's same-host dedup. v0.5.2's stale-prior + check looked only at the transport's `_closed` flag — set when + `transport.close()` had been called explicitly. But the common + dead-but-ESTABLISHED case (peer process killed; OS doesn't deliver FIN to + the survivor before macOS keepalive reaps it) leaves `_closed=false` + forever. On loopback this is a hard block — macOS default TCP_KEEPALIVE is + 7200 seconds (2 hours) before the first probe. The survivor sees the dead + socket as alive, and the dedup logic against this zombie entry rejects + every redial from the restarted peer. + + On Wi-Fi the same logical bug is much less visible — mobile TCP routes are + noisy (route flaps, ARP timeouts, AP transitions) and keepalive idle + defaults are short, so stale sockets die in seconds. On loopback there's + zero noise; the dead socket sits in ESTABLISHED indefinitely. + + Observed: Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same Mac. + Each Mac MeloMove rebuild → claude-code-mac retains a dead ESTABLISHED + socket → new Mac MeloMove's redial is rejected for 2h. iPhone ↔ + claude-code-mac on Wi-Fi recovers within seconds because Wi-Fi noise + reaps stale sockets quickly. + + Three-part fix: + + 1. **`TcpTransport` enables TCP keepalive on the socket** — + `socket.setKeepAlive(true, 1000)`. 1-second initial idle delay before + OS keepalive probes start, then OS-default probe cadence. macOS detects + dead remote in ~10s instead of ~2h. + + 2. **`inbound-connection` handler and `_createPeer` now treat stale-by- + `lastSeen` as stale.** A prior peer entry whose `lastSeen` is older + than `_heartbeatInterval` (default 10s) is now considered stale + regardless of the `_closed` flag. The remote re-dialling is itself + strong evidence its prior is dead — a healthy peer wouldn't dial + again. Closes the dead prior explicitly so its close handler runs and + removes the dict entry before the new transport is registered. + + 3. **Identity-aware close handlers.** When a stale prior is closed and + replaced, its eventual close handler must NOT clobber the new transport + entry. Both close handlers in `_createPeer` now guard with + `transports.get(source) === transport` before mutating the transports + dict. Prevents a late-firing close from a swapped-out prior tearing + down its replacement. + + Affects all peers running on the same host as another sym instance, and + any peer-restart scenario where the peer's TCP socket on the survivor + side stays in ESTABLISHED state past the OS-level FIN. + ## 0.5.2 ### Fixed