Skip to content

Remote comms: Error pattern analysis and permanent failure detection #688

@sirtimid

Description

@sirtimid

Problem

isRetryableNetworkError treats all network errors as retryable, but we can't distinguish "not running now" from "will never be running again" from "wrong address". This causes wasted retries for permanently unreachable peers.

Expected Behavior

  • Track error patterns per peer over time (error codes, frequency, success rate)
  • Classify persistent failures as permanently non-retryable after threshold
  • Stop retrying when pattern indicates permanent failure (wrong address, dead peer)
  • Continue retrying for transient failures (temporary network issues)

Implementation

Files to Modify

File Changes
platform/reconnection.ts Add error history tracking to ReconnectionManager
platform/reconnection-lifecycle.ts Check for permanent failure before attempting reconnection
@metamask/kernel-errors Update isRetryableNetworkError or add isPermanentlyFailed check

Approach

  1. Add error tracking to ReconnectionManager (platform/reconnection.ts)

    • Track error history per peer (error codes, timestamps)
    • Track consecutive identical errors
    • Track success rate over time window
  2. Implement heuristics for permanent failure detection

    • Same error code N times consecutively without success = permanent
    • Specific error patterns: persistent ECONNREFUSED, EHOSTUNREACH, DNS failures
    • Configurable thresholds
  3. Add permanent failure state

    • New state in ReconnectionManager: isPermanentlyFailed(peerId)
    • Clear permanent failure on explicit reconnect request
  4. Integrate with reconnection lifecycle (platform/reconnection-lifecycle.ts)

    • Check isPermanentlyFailed before attempting reconnection
    • Call onRemoteGiveUp when permanent failure detected

Acceptance Criteria

  • Error patterns tracked per peer in ReconnectionManager
  • Persistent failures classified as permanent after threshold
  • Permanent failures stop retry attempts
  • Transient failures continue to retry normally
  • Permanent failure state can be cleared for manual reconnection
  • Unit tests verify pattern detection and permanent failure classification
  • E2E test for permanent failure scenario

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions