Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,58 @@

> **Note:** Versions 0.3.26 – 0.3.55 were released as git tags without changelog entries. Changelog resumes at 0.3.56 below.

## 0.5.3

### Fixed

- **Same-host loopback peers stayed permanently rejected after one peer
restarted.** Companion fix to 0.5.2's same-host dedup. v0.5.2's stale-prior
check looked only at the transport's `_closed` flag — set when
`transport.close()` had been called explicitly. But the common
dead-but-ESTABLISHED case (peer process killed; OS doesn't deliver FIN to
the survivor before macOS keepalive reaps it) leaves `_closed=false`
forever. On loopback this is a hard block — macOS default TCP_KEEPALIVE is
7200 seconds (2 hours) before the first probe. The survivor sees the dead
socket as alive, and the dedup logic against this zombie entry rejects
every redial from the restarted peer.

On Wi-Fi the same logical bug is much less visible — mobile TCP routes are
noisy (route flaps, ARP timeouts, AP transitions) and keepalive idle
defaults are short, so stale sockets die in seconds. On loopback there's
zero noise; the dead socket sits in ESTABLISHED indefinitely.

Observed: Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same Mac.
Each Mac MeloMove rebuild → claude-code-mac retains a dead ESTABLISHED
socket → new Mac MeloMove's redial is rejected for 2h. iPhone ↔
claude-code-mac on Wi-Fi recovers within seconds because Wi-Fi noise
reaps stale sockets quickly.

Three-part fix:

1. **`TcpTransport` enables TCP keepalive on the socket** —
`socket.setKeepAlive(true, 1000)`. 1-second initial idle delay before
OS keepalive probes start, then OS-default probe cadence. macOS detects
dead remote in ~10s instead of ~2h.

2. **`inbound-connection` handler and `_createPeer` now treat stale-by-
`lastSeen` as stale.** A prior peer entry whose `lastSeen` is older
than `_heartbeatInterval` (default 10s) is now considered stale
regardless of the `_closed` flag. The remote re-dialling is itself
strong evidence its prior is dead — a healthy peer wouldn't dial
again. Closes the dead prior explicitly so its close handler runs and
removes the dict entry before the new transport is registered.

3. **Identity-aware close handlers.** When a stale prior is closed and
replaced, its eventual close handler must NOT clobber the new transport
entry. Both close handlers in `_createPeer` now guard with
`transports.get(source) === transport` before mutating the transports
dict. Prevents a late-firing close from a swapped-out prior tearing
down its replacement.

Affects all peers running on the same host as another sym instance, and
any peer-restart scenario where the peer's TCP socket on the survivor
side stays in ESTABLISHED state past the OS-level FIN.

## 0.5.2

### Fixed
Expand Down
57 changes: 53 additions & 4 deletions lib/node.js
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,24 @@ class SymNode extends EventEmitter {
const existingPeer = this._peers.get(peerId);
if (existingPeer && existingPeer.transports.has('bonjour')) {
const prev = existingPeer.transports.get('bonjour');
if (prev && !prev._closed) {
// Stale-prior detection. Two stale signals — either is sufficient
// to treat the prior as dead and let the new connection replace it:
// * `_closed` set: socket has explicitly been closed.
// * `lastSeen` older than one heartbeat interval: socket may be
// ESTABLISHED-but-half-open (peer process killed without FIN
// reaching us). Loopback in particular can keep dead sockets in
// ESTABLISHED state for ~2h before macOS keepalive reaps them,
// blocking every redial against the zombie entry. The remote
// re-dialling is itself strong evidence the prior is dead — it
// wouldn't dial again if it had a healthy connection. Bound by
// `_heartbeatInterval` (default 10s) — short enough that a
// restarting peer reconnects within seconds, long enough that
// a momentary lull during initial handshake doesn't trip it.
const staleByFlag = !prev || prev._closed;
const staleByLastSeen = existingPeer.lastSeen
&& (Date.now() - existingPeer.lastSeen) > this._heartbeatInterval;
const stalePrior = staleByFlag || staleByLastSeen;
if (!stalePrior) {
// Determine prior direction. The peer's `isOutbound` field reflects
// the direction at first transport creation; for a single-transport
// bonjour peer (the common case) this matches the live transport.
Expand All @@ -561,8 +578,13 @@ class SymNode extends EventEmitter {
// existingPeer.transports; _createPeer below will then add the new
// transport without the "same source already open" guard rejecting it.
prev.close();
} else if (prev && !prev._closed) {
// Stale by lastSeen but flag not yet set — close it explicitly so
// its close handler runs and removes the dict entry before we add
// the new transport.
prev.close();
}
// else: (a) stale prior — fall through to accept
// else: stale-by-flag prior — fall through to accept
}
if (handshakeMsg.e2ePublicKey) {
this._deriveAndStoreSecret(peerId, handshakeMsg.e2ePublicKey);
Expand Down Expand Up @@ -1167,13 +1189,21 @@ class SymNode extends EventEmitter {
// Section 4.6: add as secondary transport (different source — e.g. relay
// on top of bonjour). For the SAME source, three cases:
//
// (a) stale prior (_closed set) — replace with new
// (a) stale prior — replace with new. Stale = either `_closed` flag
// set OR `lastSeen` older than heartbeat interval (the second
// catches dead-but-ESTABLISHED loopback sockets that the OS
// hasn't yet reaped via keepalive — see inbound-connection
// handler comment for the long-form rationale).
// (b) same-direction duplicate (prior and new share isOutbound) —
// keep prior alive, drop new
// (c) dual-dial — opposite directions. Tie-break by nodeId: lower
// nodeId acts as client and keeps outbound; higher keeps inbound
const existingTransport = existingPeer.transports.get(source);
if (existingTransport && !existingTransport._closed) {
const staleByFlag = !existingTransport || existingTransport._closed;
const staleByLastSeen = existingPeer.lastSeen
&& (Date.now() - existingPeer.lastSeen) > this._heartbeatInterval;
const stalePrior = staleByFlag || staleByLastSeen;
if (existingTransport && !stalePrior) {
const prevIsOutbound = !!existingPeer.isOutbound;
const isDualDial = prevIsOutbound !== isOutbound;
if (!isDualDial) {
Expand All @@ -1193,11 +1223,25 @@ class SymNode extends EventEmitter {
// removes the source entry, then fall through to set the new one.
existingTransport.close();
existingPeer.isOutbound = isOutbound;
} else if (existingTransport && stalePrior && !existingTransport._closed) {
// Stale prior detected by lastSeen-age. Close it explicitly so its
// close handler can't fire later and `transports.delete(source)` away
// the new transport we're about to register. (If the close handler
// races and runs after the .set() below, it would silently revert
// this peer to a transport-less state.)
existingTransport.close();
existingPeer.isOutbound = isOutbound;
}
existingPeer.transports.set(source, transport);
this._log(`Transport added for ${peerName}: ${source} (${existingPeer.transports.size} transports)`);

// Identity-aware close handler. A previously-closed transport for the
// same source has its own pending close handler that, if it fires after
// we replace the entry here, would `transports.delete(source)` away
// this fresh transport. Guard by checking the current entry IS this
// transport before mutating.
transport.on('close', () => {
if (existingPeer.transports.get(source) !== transport) return;
existingPeer.transports.delete(source);
this._log(`Transport closed for ${peerName}: ${source} (${existingPeer.transports.size} remaining)`);

Expand Down Expand Up @@ -1225,6 +1269,11 @@ class SymNode extends EventEmitter {
const peer = { transport, transports, peerId, name: peerName, isOutbound, source, lastSeen: Date.now() };

transport.on('close', () => {
// Identity-aware: only delete if this transport is still the registered
// one. A later replacement (dual-dial / stale-prior swap) may have set
// a different transport at this source key; a stale close here must
// not clobber it. (Same guard as the existingPeer branch above.)
if (transports.get(source) !== transport) return;
transports.delete(source);
this._log(`Transport closed for ${peerName}: ${source} (${transports.size} remaining)`);

Expand Down
10 changes: 10 additions & 0 deletions lib/transport.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,16 @@ class TcpTransport extends EventEmitter {
this._parser = new FrameParser();
this._closed = false;

// TCP keepalive: detect half-open connections (peer process killed
// without graceful FIN) faster than the OS default (~2h on macOS).
// Without this, a peer that crashes leaves us with an ESTABLISHED
// socket that survives long enough to block legitimate redials —
// dedup logic against the stale entry rejects the live new connection.
// 1s initial delay then OS-default probe cadence (typically detects
// dead remote within ~10s on macOS, faster than waiting for the
// application heartbeat timeout).
try { socket.setKeepAlive(true, 1000); } catch {}

socket.on('data', (chunk) => this._parser.feed(chunk));
this._parser.on('message', (msg) => this.emit('message', msg));
this._parser.on('error', (err) => this.emit('error', err));
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@sym-bot/sym",
"version": "0.5.2",
"version": "0.5.3",
"description": "Infrastructure and protocol for multi-agent collective intelligence",
"main": "lib/node.js",
"bin": {
Expand Down