From 7fd0f580095f48e5bbaf4e848475e404072d73a0 Mon Sep 17 00:00:00 2001 From: Reiase Date: Fri, 13 Feb 2026 15:16:52 +0800 Subject: [PATCH] Add Communication Evolution documentation and update mkdocs configuration - Introduced a new document detailing the evolution of in-cluster communication technologies, covering MPI, ZMQ, RPC, and Actors, with a focus on their complexities and applications in modern AI infrastructure. - Added a Chinese version of the communication evolution document to enhance accessibility for non-English speakers. - Updated the mkdocs configuration to include the new documentation in the navigation structure, ensuring it is easily discoverable for users. --- docs/mkdocs.yml | 2 + .../design/cluster-communication-evolution.md | 448 +++++++++++++++++ .../cluster-communication-evolution.zh.md | 450 ++++++++++++++++++ 3 files changed, 900 insertions(+) create mode 100644 docs/src/design/cluster-communication-evolution.md create mode 100644 docs/src/design/cluster-communication-evolution.zh.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 4b54b8c5e..d46a09fc5 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -118,6 +118,7 @@ plugins: HTTP2 Transport: HTTP2 传输 Load Sync: 负载同步 AS Actor Decorator: AS Actor 装饰器 + Communication Evolution: 集群内通信技术演进 - mkdocstrings: handlers: python: @@ -174,6 +175,7 @@ nav: - HTTP2 Transport: design/http2-transport.md - Load Sync: design/load_sync.md - AS Actor Decorator: design/as-actor-decorator.md + - Communication Evolution: design/cluster-communication-evolution.md extra: generator: false diff --git a/docs/src/design/cluster-communication-evolution.md b/docs/src/design/cluster-communication-evolution.md new file mode 100644 index 000000000..4a2454b72 --- /dev/null +++ b/docs/src/design/cluster-communication-evolution.md @@ -0,0 +1,448 @@ +# Evolution of In-Cluster Communication: From MPI to Modern Actors + +When building distributed systems—especially modern AI infrastructure—one of the earliest architectural decisions is: **how should components talk to each other?** + +Communication paradigms shape a system’s DNA. This article reviews four generations of in-cluster communication—MPI, ZMQ, RPC, and Actors—focusing on what each generation tried to solve, and what it *cost* in terms of complexity and constraints. + +Finally, we explain why the **Actor model** is experiencing a modern renaissance at the intersection of the Rust ecosystem and AI orchestration—and why it forms the foundation of the Pulsing framework. + +--- + +## Overview: The Law of Conservation of Complexity + +The core challenge of distributed systems is **uncertainty** (latency, failures, reordering). The evolution of communication technology is essentially a shift of responsibility: **who absorbs and manages that uncertainty**. + +```mermaid +graph LR + A["MPI: Eliminate uncertainty
(Assume a deterministic world)"] --> B["ZMQ: Expose uncertainty
(Tools are given; you handle it)"] + B --> C["RPC: Mask uncertainty
(Pretend the network doesn't exist)"] + C --> D["Actor: Internalize uncertainty
(Make it part of the programming model)"] +``` + +| Dimension | MPI | ZMQ | RPC | Actor | +|------|-----|-----|-----|-------| +| **Core metaphor** | Army (marching in sync) | Walkie-talkie (free channels) | Phone call (1-to-1) | Mail system (async delivery) | +| **Control plane** | Static (fixed at launch) | Manual (built by developers) | Bolt-on (Service Mesh) | Built-in (Gossip/Supervision) | +| **State management** | Tightly coupled (SPMD) | None | Externalized (Redis/DB) | Memory-resident (stateful) | +| **Transport base** | TCP/RDMA (long-lived) | TCP (socket abstraction) | TCP/HTTP/2/QUIC (implementation-dependent; e.g. gRPC = HTTP/2) | **HTTP/2 (multiplexing)** | +| **Fault philosophy** | Crash-stop | Dev-dependent | Retry + circuit-break | Let it crash + restart | +| **Best fit** | Data plane (tensor sync) | Transport plumbing | Business plane (CRUD services) | Control plane (complex orchestration) | + +--- + +## Generation 1: MPI — Extreme Performance in a Static World + +### The monarch of rigid synchronization + +MPI (Message Passing Interface) was born in HPC. Its worldview is **static and perfect**: all nodes are ready at startup, the network is reliable, and compute loads are balanced. + +Under the BSP (Bulk Synchronous Parallel) model, MPI abstracts communication as **collectives**—all participants perform the same communication step at the same time: + +```text + Compute phase Barrier Communication phase +Rank 0: [██████████] ───── ▐ ──────────▶ AllReduce +Rank 1: [████████] ───── ▐ ──────────▶ AllReduce +Rank 2: [████████████] ──── ▐ ──────────▶ AllReduce +Rank 3: [████████] ───── ▐ ──────────▶ AllReduce + ▲ + Barrel effect: the slowest rank dictates the pace +``` + +This “everyone must arrive” rigidity is both the source of MPI’s peak performance and the root of its limitations. + +### Why MPI remains irreplaceable + +In the **data plane** of AI training—e.g. gradient synchronization via AllReduce—MPI (and its GPU-specialized descendant NCCL) still dominates. + +The reason is the optimization space enabled by **predefined patterns**: +- Topology is known (Ring, Tree, Recursive Halving-Doubling), enabling precise routing for hardware (NVLink, InfiniBand, PCIe). +- Buffer sizes are known, enabling DMA zero-copy and pipelined overlap (compute/comm overlap). +- Participant sets are fixed, enabling scheduling and tuning around a stable world. + +In short, MPI’s power comes from **maximally exploiting determinism**. + +### Limitation: when determinism crumbles + +As AI systems evolve beyond pure data parallelism, MPI’s “determinism assumptions” often stop holding: + +| Scenario | MPI assumption | Reality | +|------|------------|------| +| Pipeline parallelism | stages are homogeneous | stage compute differs; irregular point-to-point is needed | +| Inference serving | uniform load | request arrival is stochastic; dynamic load balancing is needed | +| Agent collaboration | fixed topology | collaboration relationships change at runtime | +| Elastic training | fixed node count | nodes may fail; scaling is required | + +More critically, MPI’s **fault tolerance is close to zero**—one rank failure often forces communicator rebuild. On thousand-GPU clusters, failures are not “if”, but “how often”. + +> MPI answers “how to communicate efficiently in a deterministic world”, but struggles with “what to do when the world is uncertain”. + +--- + +## Generation 2: ZMQ — Freedom, with Sharp Edges + +### From collectives to point-to-point + +MPI’s core limitation is that it assumes everyone participates in each step. When you only need “A sends a message to B”, collectives are too heavy. + +ZeroMQ (ØMQ) emerged as “sockets with batteries”, offering flexible async messaging primitives for arbitrary topologies: + +```text +MPI world (regular): ZMQ world (free-form): + + 0 ←──AllReduce──▶ 1 0 ──push──▶ 1 + ▲ ▲ │ │ + │ AllReduce │ ▼ ▼ + ▼ ▼ 2 ◀──req──── 3 + 2 ←──AllReduce──▶ 3 pub + │ + Everyone must participate 4 ◀──sub── 5 + any connection, any time +``` + +REQ/REP, PUB/SUB, PUSH/PULL, DEALER/ROUTER, etc. make it easy to wire diverse patterns, with very high performance via async I/O and multiplexing. + +### Problem: mechanism without policy + +ZMQ intentionally stays at the **transport layer**. It provides mechanisms, but not application-level policy. Coordination becomes the developer’s job. + +In practice, this often shows up as: + +**Stalls / “waiting forever”**—easy to trigger when application logic doesn’t coordinate well (not always a kernel deadlock; often waiting for an event that will never happen): + +```text +❌ Scenario 1: REQ/REP finite-state rules get violated + + # A REQ socket must follow: send -> recv -> send -> recv ... + A(REQ): send(req1) → send(req2) # second send may block or raise EFSM (binding/config dependent) + A(REQ): recv(resp1) # response ordering assumptions are also easy to get wrong + +❌ Scenario 2: DIY correlation (DEALER/ROUTER) misses a branch + + A: send(req, id=42) → await(resp, id=42) + B: receives req(id=42) but crashes/reconnects/forgets to reply in some branch + A: waits forever for resp(id=42) (no built-in paradigm forcing “must reply” or “must timeout”) +``` + +**App-perceived “message loss”**—common with PUB/SUB and during reconnect / subscription propagation: + +```text +❌ Scenario: subscription not established yet (or reconnect window) + + Pub: send(msg) # publisher emits + Sub: [not connected / subscription filter not delivered yet] # subscriber not “ready” + Sub: recv() # msg is not replayed (PUB/SUB does not retain history for late joiners) + +Notes: + - In non-PUB/SUB patterns, ZMQ often buffers in in-process memory queues; but those queues are in-process. + Process crashes, HWM limits, or non-blocking sends (EAGAIN) can still look like “it never arrived”. +``` + +**Backpressure “gotchas”**—when consumption can’t keep up with production: + +- ZMQ provides **High Water Mark (HWM)** and related knobs, but behavior depends heavily on socket type and send mode: + it may block, it may return EAGAIN (non-blocking), and in PUB/SUB some situations can drop. +- Structurally, it lacks an end-to-end, semantics-aligned backpressure paradigm: how to express *must be ordered, must not drop tokens*, + plus consistent cancellation/timeout/retry semantics. +- In AI inference (e.g. LLM token streaming), sustained producer > consumer often becomes a trade-off: buffer memory, drop data, or block and risk cascaded stalls. + +Developers end up implementing heartbeats, ACKs, retry queues, and connection state machines—often the most fragile part of the system. + +> ZMQ answers “how to let any node talk to any node”, but pushes “how to be reliable” onto every developer. + +--- + +## Generation 3: RPC — The Cost of Pretending to be Local + +### Call semantics: rescuing structure from ZMQ chaos + +Facing the “I sent a request, will I get a response?” problem, RPC wraps remote communication into **call semantics**. + +```python +# ZMQ: two steps; manual pairing; forget recv and you stall +socket.send(request) +response = socket.recv() + +# RPC: request/response are bound by language semantics +response = service.compute(request) +``` + +This is more than syntax sugar. Call semantics impose three constraints: + +1. **Request-response pairing**: you can’t “send and forget to receive”. +2. **Deadline/timeout mechanisms**: most frameworks support deadlines, but you typically must **set them explicitly** (many defaults can wait “forever”). +3. **IDL contracts**: Protobuf/IDL fixes the protocol at build time. + +### The fallacies of distributed computing + +RPC’s promise—**remote calls feel local**—is a leaky abstraction. Peter Deutsch’s “Eight Fallacies of Distributed Computing” remain true: + +- The network is reliable +- Latency is zero — in practice often \(10^3\) to \(10^6\)× gaps depending on serialization, kernel, network, and queuing +- Bandwidth is infinite +- The network is secure +- Topology doesn’t change +- There is one administrator +- Transport cost is zero +- The network is homogeneous + +RPC tends to encourage code that “pretends the network doesn’t exist”; when it fails, error handling becomes a patch afterwards. + +### The “microservice tax” + +RPC is fundamentally point-to-point and lacks a unified **cluster view**. In practice, as systems scale, teams often stack supporting infrastructure (not every system needs all of it, but it’s common at scale): + +```text +What a “simple” RPC often looks like in reality: + +[Service A] ──IPC── [Sidecar/Envoy] ──network── [Sidecar/Envoy] ──IPC── [Service B] + ▲ ▲ + │ Control Plane │ + └──── [Etcd] [Consul] ───────┘ + ▲ + [Prometheus] [Jaeger] [Sentinel] +``` + +- **Service discovery**: Etcd / Consul / ZooKeeper — where is the service? +- **Traffic governance**: Envoy / Nginx — LB, circuit breaking, rate limiting +- **Sidecars**: separating governance from business code + +Each component is a system you deploy and operate—this is the “microservice tax”. + +### The stateless fallacy + +For AI systems, RPC often conflicts with **stateless-first architecture**. + +Microservices frequently push stateless services (state in Redis/DB) for horizontal scaling and failover. This is great for CRUD, but often very costly for AI: + +- **KV cache**: LLM inference context must reside in GPU memory. Pulling from Redis, deserializing, and loading into GPU per request is too slow. +- **Agent memory**: multi-turn state and reasoning traces need to persist. +- **Model weights**: loading large models takes tens of seconds; per-request reload is impossible. + +> RPC answers “how to make remote calls reliable”, but its ecosystem tends to rely on bolt-on clustering, and stateless-first designs clash with AI reality. + +--- + +## Generation 4: Actor — Embracing Uncertainty + +### From masking to internalizing + +The first three generations respond to network uncertainty by eliminating it (MPI), exposing it (ZMQ), or masking it (RPC). The Actor model takes a fourth path: **make uncertainty part of the programming model**. + +Originating with Carl Hewitt (1973) and popularized by Erlang (1986) and Akka (2009), its core principles are: + +**1. Message passing, not function calls** + +Actors don’t “call” each other; they **deliver messages**. This small shift changes everything: + +- Sending is async and non-blocking—senders don’t stall because receivers are slow. +- Each Actor has a **mailbox** as a natural buffer, decoupling producers and consumers. +- Delivery is “best-effort”, forcing failure to be part of the design, not an afterthought. + +**2. Private state, not shared memory** + +Actors encapsulate private state, accessible only via messages: + +- No locks, no races—concurrency safety is architecturally guaranteed. +- State is memory-resident—naturally fitting KV caches and agent memory. + +**3. “Let it crash”** + +Instead of preventing all failures, accept failures as normal and build **supervision**: parent actors monitor children and apply policies (restart/stop/escalate) when failures happen. + +```text + [Supervisor] + / \ + [Worker A] [Worker B] + ↓ ↓ + crash! running + ↓ + auto-restart ✓ +``` + +This “isolate + restart” pattern gives the system self-healing properties: a single actor crash doesn’t take down the whole system. + +### Why actors synthesize the strengths of the previous three + +| Previous pain point | Actor answer | +|----------|-------------| +| MPI: hard to express irregular comms | any actors can communicate anytime | +| ZMQ: no coordination paradigm | Ask/Tell/Stream provide semantic constraints | +| RPC: lacks cluster capability | built-in discovery + location transparency | +| RPC: stateless limitation | actors are naturally stateful, memory-resident | +| All: fragile fault tolerance | supervision + automatic restart | + +--- + +## Pulsing: Modern Actors Powered by Rust + +The Actor model is conceptually strong, but earlier implementations had trade-offs—Erlang’s performance ceiling, Akka’s JVM GC pauses, Ray’s Python+C++ complexity. + +Pulsing modernizes the Actor model using **Rust + Tokio + HTTP/2**, with concrete engineering decisions across five dimensions. + +### 1. Zero-dependency clustering: Gossip + SWIM + +RPC stacks often need bolt-on Etcd/Consul for discovery. Pulsing builds clustering into the Actor System. + +Each node only needs a seed address. Membership is exchanged via Gossip; failures are detected via SWIM: + +```text +Startup (Kubernetes example): + + New Pod Service IP Existing Pods + │ │ │ + ├── Probe 1 (Join) ────▶ ├── routed to Pod A ─▶ │ + │◀── Welcome [A] ───────┤ │ + │ │ │ + ├── Probe 2 (Join) ────▶ ├── routed to Pod B ─▶ │ + │◀── Welcome [A,B] ─────┤ │ + │ │ │ + └── normal Gossip begins ────────────────────────┘ + then every ~200ms sync with random peers +``` + +On failure, SWIM’s Ping → Suspect → Dead state machine cleans up membership and actor registrations: + +- Ping timeout → Suspect +- Suspect timeout → Dead (removed from members) +- actor registrations on the dead node are cleaned up + +**Key design**: all communication (actor messages + gossip + health checks) shares a single HTTP port—simplifying networking and firewall rules. + +### 2. Location-transparent addressing: a unified Actor URI scheme + +In RPC, clients must know service endpoints. Pulsing provides a URI-style address scheme so physical location is transparent: + +```text +actor:///services/llm/router → named actor (cluster-routed) +actor:///services/llm/router@node_a → specific node instance +actor://node_a/worker_123 → globally precise address +actor://localhost/worker_123 → local shorthand +``` + +The same named actor can run multiple instances across nodes; the system load-balances automatically: + +```python +# deployed on Node A +await system.spawn(LLMRouter(), name="services/llm/router") + +# deployed on Node B (same name) +await system.spawn(LLMRouter(), name="services/llm/router") + +# resolved from any node; instance chosen automatically +router = await system.resolve("services/llm/router") +result = await router.ask(request) # routed to A or B +``` + +This removes the need for external LBs while keeping the API minimal. + +### 3. Stateful orchestration: Actors as owners of state + +This is the most fundamental advantage of Actors over stateless RPC. In Pulsing, actors are **owners** of state: + +```python +@pul.remote +class InferenceWorker: + def __init__(self, model_path: str): + self.model = load_model(model_path) # weights stay in memory + self.kv_cache = {} # KV cache stays in memory + + async def generate(self, prompt: str): + # reuse in-memory model and cache; no per-request DB load + for token in self.model.generate(prompt, cache=self.kv_cache): + yield {"token": token} +``` + +Implications: +- **KV cache residency**: no per-request serialize/deserialize round-trips. +- **Weight residency**: load once, serve for the lifetime of the process. +- **Agent memory residency**: multi-turn context lives inside actor state. + +### 4. Streaming & backpressure: HTTP/2 end-to-end flow control + +LLM token streaming is a canonical AI communication pattern. Traditional RPC streaming can be awkward; ZMQ can be tricky under backpressure. + +Pulsing uses **h2c (HTTP/2 over cleartext)** to leverage HTTP/2 multiplexing and flow control: + +**Connection multiplexing**: nodes keep one long-lived TCP connection; Ask/Tell/Stream map to independent HTTP/2 streams. No per-actor connection explosion. + +**End-to-end backpressure**: HTTP/2 Flow Control Windows automatically throttle producers when consumers slow down: + +```text +When producer rate > consumer rate: + + [LLM Actor] [Network] [Client] + │ │ │ + │── Token ────────▶ │── Token ────────▶ │ + │── Token ────────▶ │── Token ────────▶ │ ← client slows + │── Token ────────▶ │ H2 window full │ + │ send() Pending │◀─ stop ACKs ─────│ ← window exhausted + │ ← generation pauses automatically │ + │ │ │ ← client catches up + │ send() resumes │◀─ Window Update ─│ ← window freed + │── Token ────────▶ │── Token ────────▶ │ +``` + +No user-written flow-control code is required: Rust `Future` Pending semantics and HTTP/2 flow control compose naturally, propagating backpressure from network to application. + +### 5. Type safety: compile-time communication contracts + +MPI often manipulates `void*`; ZMQ sends blobs; type errors surface at runtime. Pulsing pushes contracts to compile time via Rust’s type system. + +With the Behavior API (inspired by Akka Typed), message types are checked at compile time: + +```rust +// A type-safe counter actor +fn counter(initial: i32) -> Behavior { + stateful(initial, |count, msg, ctx| match msg { + CounterMsg::Increment(n) => { + *count += n; + BehaviorAction::Same // keep current behavior + } + CounterMsg::Reset => { + *count = 0; + BehaviorAction::Same + } + }) +} + +// TypedRef ensures only CounterMsg can be sent +let counter: TypedRef = system.spawn(counter(0)); +counter.tell(CounterMsg::Increment(5)).await?; // ✅ compiles +// counter.tell("hello").await?; // ❌ compile error +``` + +More importantly, `BehaviorAction::Become` allows safe behavior switches—natural for building agent state machines like `Idle → Thinking → Answering → Idle`. + +--- + +## Summary: Positioning & Collaboration + +The evolution of these four generations is about answering progressively deeper questions: + +| Generation | Core question | Answer | +|------|----------|------| +| MPI | How do many processes synchronize data efficiently? | collectives + BSP | +| ZMQ | How can any two processes talk at any time? | async messaging primitives | +| RPC | How can remote communication feel reliable like local calls? | call semantics + IDL | +| Actor | How can a group of processes behave like one system? | message passing + supervision + cluster awareness | + +Each generation doesn’t negate the previous; it complements it in new dimensions. + +If **MPI/NCCL** is the **superhighway** of AI infrastructure—the data plane for massive tensor transport—then **Pulsing** is the **intelligent traffic control system**—the control plane for complex orchestration: + +- not rigid like MPI; adapts to dynamic agent topologies +- not primitive like ZMQ; provides governance and clear paradigms +- not heavy like RPC stacks; remains lightweight and low-latency +- naturally stateful; fits KV cache and agent memory realities + +In the spiral of technology, the Actor model—powered by Rust—returns as a strong primitive for the next generation of distributed intelligent systems. + +--- + +## Further Reading + +- [Actor System Design](actor-system.md) — Pulsing core architecture +- [HTTP/2 Transport](http2-transport.md) — streaming and backpressure details +- [Node Discovery](node-discovery.md) — Gossip + SWIM clustering +- [Actor Addressing](actor-addressing.md) — URI address system +- [Behavior API](behavior.md) — type-safe functional actors diff --git a/docs/src/design/cluster-communication-evolution.zh.md b/docs/src/design/cluster-communication-evolution.zh.md new file mode 100644 index 000000000..fc5f9426f --- /dev/null +++ b/docs/src/design/cluster-communication-evolution.zh.md @@ -0,0 +1,450 @@ +# 集群内通信技术演进:从 MPI 到现代 Actor + +在构建分布式系统(特别是现代 AI 基础设施)时,架构师面临的首要决策往往是:**我们该如何对话?** + +通信范式的选择决定了系统的基因。本文将从技术演进的视角,深度剖析分布式通信技术的四个代际:MPI、ZMQ、RPC 以及 Actor。我们将深入探讨每一代技术试图解决的根本问题,以及它们为此付出的代价。 + +最终,我们将揭示为什么在 Rust 生态和 AI 编排的交汇点上,**Actor 模型**正在经历一次现代化的复兴,并成为 Pulsing 框架的核心基石。 + +--- + +## 演进总览:复杂性守恒定律 + +分布式系统的核心挑战在于**不确定性**(网络延迟、节点故障、消息乱序)。通信技术的演进,本质上是关于"谁来处理这些不确定性"的权力转移过程。 + +```mermaid +graph LR + A["MPI: 消除不确定性
(假设世界是确定的)"] --> B["ZMQ: 暴露不确定性
(给你工具,自己处理)"] + B --> C["RPC: 掩盖不确定性
(假装网络不存在)"] + C --> D["Actor: 内化不确定性
(将其纳入编程模型)"] +``` + +| 维度 | MPI | ZMQ | RPC | Actor | +|------|-----|-----|-----|-------| +| **核心隐喻** | 军队 (同步行进) | 对讲机 (自由频道) | 电话 (一对一呼叫) | 邮件系统 (异步投递) | +| **控制平面** | 静态 (启动即固定) | 手动 (开发者自建) | 外挂 (Service Mesh) | 内建 (Gossip/成员管理/寻址) | +| **状态管理** | 紧耦合 (SPMD) | 无 | 外部化 (Redis/DB) | 内存驻留 (Stateful) | +| **通信基座** | TCP/RDMA (长连接) | TCP (Socket抽象) | TCP/HTTP/2/QUIC(取决于实现;如 gRPC=HTTP/2) | **HTTP/2 (多路复用)** | +| **容错哲学** | 崩溃即停止 | 依赖开发者 | 重试 + 熔断 | Let it crash +(按策略恢复) | +| **适用领域** | 数据面 (Tensor 同步) | 传输层 (底层管道) | 业务面 (CRUD 服务) | 控制面 (复杂编排) | + +--- + +## 第一代:MPI —— 静态世界的极致性能 + +### 刚性同步的霸主 + +MPI (Message Passing Interface) 诞生于高性能计算 (HPC) 领域。它的世界观是**静态且完美**的:所有节点在启动时都已就绪,网络是可靠的,计算任务是均匀的。 + +在 BSP (Bulk Synchronous Parallel) 模型下,MPI 将通信抽象为**集合操作 (Collectives)**——所有参与者在同一时刻执行相同的通信操作: + +```text + 计算阶段 Barrier 通信阶段 +Rank 0: [██████████] ───── ▐ ──────────▶ AllReduce +Rank 1: [████████] ───── ▐ ──────────▶ AllReduce +Rank 2: [████████████] ──── ▐ ──────────▶ AllReduce +Rank 3: [████████] ───── ▐ ──────────▶ AllReduce + ▲ + 木桶效应:最慢的节点决定整体速度 +``` + +这种"所有人必须到齐"的刚性同步,是 MPI 性能极致的来源,也是它局限性的根源。 + +### 为什么 MPI 至今无法被替代? + +在 AI 训练的**数据面 (Data Plane)**——例如数据并行中的梯度同步(AllReduce)——MPI(及其 GPU 特化版本 NCCL)仍然是绝对的统治者。 + +原因在于**通信模式的预定义性**带来的优化空间: +- 通信拓扑是已知的(Ring, Tree, Recursive Halving-Doubling),可以针对硬件拓扑(NVLink, InfiniBand, PCIe)做精确的路径规划。 +- Buffer 大小是已知的,可以做 DMA 零拷贝、Pipeline 重叠(计算与通信重叠)。 +- 参与者集合是固定的,可以做编译期的通信调度优化。 + +换句话说,MPI 的威力来自于**对确定性世界的极致利用**。 + +### 局限:确定性假设的崩塌 + +然而当 AI 系统从单纯的数据并行演化到更复杂的形态时,MPI 的"确定性假设"开始崩塌: + +| 场景 | MPI 的假设 | 现实 | +|------|------------|------| +| 流水线并行 | 所有 Stage 同构 | Stage 间计算量差异大,需要不规则的点对点通信 | +| 推理服务 | 负载均匀 | 请求到达率随机,需要动态负载均衡 | +| Agent 协作 | 通信拓扑固定 | Agent 间的协作关系在运行时动态变化 | +| 弹性训练 | 节点数固定 | 节点可能故障、可能需要扩缩容 | + +更致命的是,MPI 的**容错几乎为零**——一个 Rank 挂掉,整个 Communicator 就需要重建。在千卡集群上,节点故障不是"如果"的问题,而是"多久一次"的问题。 + +> MPI 解决了"如何在确定世界中高效通信",但它无法回答"当世界变得不确定时怎么办"。 + +--- + +## 第二代:ZMQ —— 自由但危险的积木 + +### 从集合通信到点对点 + +MPI 的核心局限在于:它假设所有节点在每个通信步骤中都参与。当我们只需要"节点 A 给节点 B 发一条消息"时,MPI 的集合操作就显得笨重了。 + +ZeroMQ (ØMQ) 应运而生。它自称"带电池的 Socket",提供了一套灵活的异步消息原语,让任意两个节点可以随时通信: + +```text +MPI 的世界(规则通信): ZMQ 的世界(自由通信): + + 0 ←──AllReduce──▶ 1 0 ──push──▶ 1 + ▲ ▲ │ │ + │ AllReduce │ ▼ ▼ + ▼ ▼ 2 ◀──req──── 3 + 2 ←──AllReduce──▶ 3 pub + │ + 所有人必须参与 4 ◀──sub── 5 + 任意连接,任意时刻 +``` + +ZMQ 通过 REQ/REP、PUB/SUB、PUSH/PULL、DEALER/ROUTER 等 Socket 模式,让开发者能灵活地搭建任意通信拓扑。它实现了**真正的异步 IO 和多路复用**,性能极高。 + +### 问题:只有机制 (Mechanism),没有策略 (Policy) + +ZMQ 的伟大之处在于它解构了通信原语,但它刻意停留在**传输层**,拒绝提供应用层的通信策略。这意味着所有**协调的责任**被完全交给了开发者。 + +在实际项目中,这导致了几类典型的痛点: + +**卡死/等不到消息**——两端逻辑配合不当时很容易发生(这不一定是底层 socket 的“死锁”,更多是应用协议层面的“等待永远不会发生的事件”): + +```text +❌ 场景 1:REQ/REP 的状态机约束被破坏 + + # REQ socket 必须严格遵循:send -> recv -> send -> recv ... + A(REQ): send(req1) → send(req2) # 第二次 send 可能阻塞或报 EFSM(取决于语言绑定/配置) + A(REQ): recv(resp1) # 逻辑上期望的 resp1/resp2 顺序也容易被写乱 + +❌ 场景 2:DEALER/ROUTER 自己实现相关性(correlation)时漏掉一种分支 + + A: send(req, id=42) → await(resp, id=42) + B: 收到 req(id=42),处理过程中异常退出/重连/逻辑分支漏回包 + A: 永远等不到 id=42 的 resp(没有统一的范式来强制“必回包/必超时”) +``` + +**消息“看似丢失”**——常见于 PUB/SUB 以及重连/订阅传播阶段: + +```text +❌ 场景:订阅尚未建立(或重连期间) + + Pub: send(msg) # 发布端发出 + Sub: [尚未 connect / 尚未发送订阅过滤器] # 订阅者还没“准备好” + Sub: recv() # 这条 msg 不会补发(PUB/SUB 默认不为晚到订阅者保留历史) + +说明: + - 对于非 PUB/SUB 模式,ZMQ 往往会在发送端/中间层做内存队列缓冲;但队列是内存态的, + 进程崩溃、队列达到 HWM、或使用 non-blocking 发送时,都可能表现为“消息没到/没了”。 +``` + +**背压 (Backpressure) 失控**——当消费速度跟不上生产速度时: + +- ZMQ 提供了 **高水位 (HWM)** 等机制,但其行为强烈依赖 socket 类型与发送方式:可能阻塞、可能返回 EAGAIN(non-blocking),PUB/SUB 在一些情况下还可能直接丢弃。 +- 更关键的是:它缺少一个“端到端”的、与业务语义对齐的背压范式(比如:如何表达 *必须有序、不丢 token*,以及 *取消/超时/重试* 的一致语义)。 +- 在 AI 推理场景中(如 LLM Token 流式生成),一旦生产速度持续高于消费速度,就容易陷入“要么堆内存、要么丢数据、要么阻塞卡死”的工程权衡。 + +开发者不得不手动实现心跳协议、ACK 确认、重试队列、连接状态机……这些 ad-hoc 代码往往成为系统中最脆弱的部分。 + +> ZMQ 解决了"如何让任意节点随时通信",但它把"如何保证通信可靠"这个更难的问题留给了每一个开发者。 + +--- + +## 第三代:RPC —— 伪装成本地调用的代价 + +### Call 语义:对 ZMQ 混乱的救赎 + +面对 ZMQ "发了请求不知道收不收得到响应"的困境,RPC (Remote Procedure Call) 给出了一个优雅的解法:**用函数调用的语义来封装远程通信**。 + +```python +# ZMQ:两步操作,需要手动配对,忘记 recv 就卡死 +socket.send(request) +response = socket.recv() + +# RPC:一行代码,请求和响应在语法层面就是绑定的 +response = service.compute(request) +``` + +这不仅仅是语法糖。RPC 的 Call 语义带来了三个关键约束: + +1. **请求-响应自动配对**:不可能出现"发了请求忘记收响应"。 +2. **超时/截止时间机制**:多数 RPC 框架提供 deadline/timeout,但通常需要**显式设置**(不少实现默认可以“无限等”)。 +3. **接口契约 (IDL)**:通过 Protobuf 等接口定义语言,编译期就确定了通信协议。 + +### "分布式计算的误区" + +然而 RPC 的美好愿景——**让远程调用看起来像本地调用**——本身就是一个危险的抽象泄漏。Peter Deutsch 提出的“分布式计算的八大误区 (Fallacies of Distributed Computing)”至今仍然成立: + +- 网络是可靠的(The network is reliable) +- 延迟为零(Latency is zero)——现实中常见差异在 \(10^3\) 甚至 \(10^6\) 倍量级(取决于序列化、内核、网络、排队) +- 带宽是无限的(Bandwidth is infinite) +- 网络是安全的(The network is secure) +- 拓扑不变(Topology doesn't change) +- 只有一个管理员(There is one administrator) +- 传输成本为零(Transport cost is zero) +- 网络是同质的(The network is homogeneous) + +RPC 试图掩盖这些现实,诱导开发者写出"假装网络不存在"的代码。当网络真的出问题时,错误处理往往是事后补丁。 + +### "微服务税" (The Microservice Tax) + +RPC 更深层的问题在于:它本质上只是一个**点对点的调用协议**,没有"集群"的统一视图。为了让 RPC 在大规模系统中可用,工程上经常会在它之上堆叠大量配套设施(不一定每个系统都需要,但一旦规模上来就很常见): + +```text +一个"简单"的 RPC 调用,实际经过的路径: + +[Service A] ──IPC── [Sidecar/Envoy] ──network── [Sidecar/Envoy] ──IPC── [Service B] + ▲ ▲ + │ Control Plane │ + └──── [Etcd] [Consul] ───────┘ + ▲ + [Prometheus] [Jaeger] [Sentinel] +``` + +- **服务发现**:Etcd / Consul / ZooKeeper —— 告诉客户端"服务在哪" +- **流量治理**:Envoy / Nginx —— 负载均衡、熔断限流 +- **Sidecar 模式**:把治理逻辑从业务代码中剥离出去 + +每一项都是独立的系统,都需要部署和运维。这就是"微服务税"——你以为你只是在写业务逻辑,但实际上你在运维一个庞大的基础设施帝国。 + +### 无状态的谬误 + +对于 AI 系统来说,RPC 还有一个更根本的冲突:**无状态 (Stateless) 设计**。 + +传统微服务架构常推崇无状态——每个请求都是独立的,状态存储在 Redis/DB 中,以便水平扩展与故障迁移。这对 CRUD 业务很合理,但对 AI 场景往往代价极高: + +- **KV Cache**:LLM 推理的上下文缓存需要驻留在 GPU 显存中。每次请求从 Redis 拉取、反序列化、加载到 GPU,延迟不可接受。 +- **Agent Memory**:智能体在多轮对话中积累的记忆和推理状态,天然需要持久驻留。 +- **模型权重**:加载一个大模型需要几十秒,不可能每个请求都重新加载。 + +> RPC 解决了"如何让远程通信可靠",但它缺乏集群视角,且它的无状态哲学与 AI 场景天然冲突。 + +--- + +## 第四代:Actor —— 拥抱不确定性 + +### 从掩盖到内化 + +前三代技术对网络不确定性的态度分别是:消除它 (MPI)、暴露它 (ZMQ)、掩盖它 (RPC)。Actor 模型选择了第四条路:**将不确定性内化为编程模型的一部分**。 + +这个思想最早由 Carl Hewitt 在 1973 年提出,后来被 Erlang (1986) 和 Akka (2009) 发扬光大。它的核心原则: + +**1. 消息传递,而非函数调用** + +Actor 之间不"调用"彼此,而是"投递消息"。这个看似微小的区别,带来了根本性的变化: + +- 发送是异步的、非阻塞的——发送端不会因为接收端慢而卡住。 +- 每个 Actor 有自己的**邮箱 (Mailbox)**,作为天然的缓冲区,解耦了生产者和消费者。 +- 消息的投递语义是"尽力而为",而不是"保证到达"——这迫使开发者从设计之初就考虑失败场景。 + +**2. 私有状态,而非共享内存** + +每个 Actor 封装自己的状态,外部只能通过消息访问。这意味着: +- 没有锁、没有竞态条件——并发安全是架构层面保证的。 +- 状态是**内存驻留**的——天然适合 AI 场景中的 KV Cache、Agent Memory。 + +**3. "Let it crash" 哲学** + +与其试图预防所有故障(如 MPI 的做法),不如承认故障是常态,并建立**监督 (Supervision)** 机制:父 Actor 监控子 Actor,当子 Actor 崩溃时,按照预定义的策略(重启/停止/上报)进行处理。 + +```text + [Supervisor] + / \ + [Worker A] [Worker B] + ↓ ↓ + 崩溃! 正常运行 + ↓ + 自动重启 ✓ +``` + +这种"隔离 + 重启"的模式,让系统具备了**自愈能力**——单个 Actor 的失败不会拖垮整个系统。 + +!!! note + 这里的“监督/重启”描述的是 Actor 模型的经典做法(如 Erlang/Akka)。在 Pulsing 中,目前提供的是 **actor 级别的重启策略**(如 Python `@pul.remote(restart_policy=...)`),并不等同于完整的 supervision tree(后者通常涉及父子层级、策略传播与结构化的失败处理)。 + +### 为什么 Actor 能综合前三代的优势? + +回顾四代技术的痛点,Actor 模型的回答是系统性的: + +| 前代痛点 | Actor 的回答 | +|----------|-------------| +| MPI:无法不规则通信 | 任意 Actor 之间可以随时通信 | +| ZMQ:缺乏协调范式 | Ask/Tell/Stream 提供了明确的语义约束 | +| RPC:缺乏集群能力 | 内建服务发现、位置透明寻址 | +| RPC:无状态限制 | Actor 天然有状态,状态内存驻留 | +| 所有:容错脆弱 | 监督 + 自动重启 | + +--- + +## Pulsing:Rust 加持下的现代 Actor + +Actor 模型的思想很好,但早期的实现各有不足——Erlang 的性能天花板、Akka 受限于 JVM 的 GC 暂停、Ray 的 Python+C++ 双层架构的复杂性。 + +Pulsing 代表了 Actor 模型的一次**现代化重构**,它利用 **Rust + Tokio + HTTP/2** 技术栈,在四个核心维度上做出了具体的工程决策。 + +### 1. 零依赖组网:Gossip + SWIM 自动集群 + +RPC 系统需要外挂 Etcd/Consul 来做服务发现。Pulsing 将集群能力**内建**到了 Actor System 中。 + +每个 Pulsing 节点启动时,只需知道一个种子地址。节点通过 Gossip 协议自动交换成员信息,通过 SWIM 协议检测故障: + +```text +启动流程(以 Kubernetes 为例): + + New Pod Service IP Existing Pods + │ │ │ + ├── Probe 1 (Join) ────▶ ├── 路由到 Pod A ────▶ │ + │◀── Welcome [A] ───────┤ │ + │ │ │ + ├── Probe 2 (Join) ────▶ ├── 路由到 Pod B ────▶ │ + │◀── Welcome [A,B] ─────┤ │ + │ │ │ + └── 开始正常 Gossip ─────────────────────────────┘ + 此后每 200ms 与随机节点同步状态 +``` + +节点故障时,SWIM 协议通过 Ping → Suspect → Dead 状态机自动检测并清理: + +- Ping 超时 → 标记为 Suspect +- Suspect 超时 → 标记为 Dead,从成员列表移除 +- 该节点上的所有 Actor 注册信息被自动清理 + +**关键设计:所有通信(Actor 消息 + Gossip 协议 + 健康检查)共用一个 HTTP 端口**。这极大简化了网络配置和防火墙规则。 + +### 2. 位置透明寻址:统一的 Actor 地址体系 + +RPC 系统中,客户端必须知道服务的地址(IP:Port)。Pulsing 设计了一套 URI 风格的地址体系,让 Actor 的物理位置对使用者完全透明: + +```text +actor:///services/llm/router → 具名 Actor(集群自动路由) +actor:///services/llm/router@node_a → 指定节点实例 +actor://node_a/worker_123 → 全局精确地址 +actor://localhost/worker_123 → 本地快捷引用 +``` + +同一个具名 Actor 可以在多个节点上部署实例,系统自动做负载均衡: + +```python +# Node A 上部署 +await system.spawn(LLMRouter(), name="services/llm/router") + +# Node B 上也部署同名 Actor +await system.spawn(LLMRouter(), name="services/llm/router") + +# 从任意节点访问——系统自动选择实例 +router = await system.resolve("services/llm/router") +result = await router.ask(request) # 可能路由到 A 或 B +``` + +这解决了 RPC 需要外挂负载均衡器的问题,同时保持了 API 的极简。 + +### 3. 有状态编排:Actor 作为状态的 Owner + +这是 Actor 相对于无状态 RPC 的最根本优势。在 Pulsing 中,Actor 是状态的**所有者 (Owner)**: + +```python +@pul.remote +class InferenceWorker: + def __init__(self, model_path: str): + self.model = load_model(model_path) # 模型权重常驻内存 + self.kv_cache = {} # KV Cache 常驻内存 + + async def generate(self, prompt: str): + # 直接使用内存中的模型和缓存,无需每次从 DB 加载 + for token in self.model.generate(prompt, cache=self.kv_cache): + yield {"token": token} +``` + +这意味着: +- **KV Cache 常驻**:推理上下文不需要在请求间序列化/反序列化。 +- **模型权重常驻**:加载一次,服务终身。 +- **Agent 记忆常驻**:多轮对话的上下文直接存在 Actor 内存中。 + +### 4. 流式与背压:HTTP/2 驱动的端到端流控 + +LLM 的 Token 流式生成是 AI 场景中最典型的通信模式。传统 RPC 处理流式响应往往比较笨拙,ZMQ 则容易在背压处理上失控。 + +Pulsing 基于 **h2c (HTTP/2 over cleartext)** 设计了传输层,天然利用 HTTP/2 的多路复用和流控机制: + +**连接复用**:节点间维持一条 TCP 长连接,所有 Actor 消息(Ask / Tell / Stream)作为独立的 HTTP/2 Streams 并行传输。不需要为每个 Actor 对建立独立连接。 + +**端到端背压**:利用 HTTP/2 的 Flow Control Window 机制,实现从消费端到生产端的自动减速: + +```text +Token 生成速度 > 消费速度时,自动减速过程: + + [LLM Actor] [Network] [Client] + │ │ │ + │── Token ────────▶ │── Token ────────▶ │ + │── Token ────────▶ │── Token ────────▶ │ ← 处理变慢 + │── Token ────────▶ │ H2 窗口填满 │ + │ send() Pending │◀─ 不再发送 WINDOW_UPDATE ─│ ← 窗口耗尽 + │ ← 自动暂停生成 │ │ + │ │ │ ← 处理完成 + │ send() 恢复 │◀─ Window Update ──│ ← 释放窗口 + │── Token ────────▶ │── Token ────────▶ │ +``` + +整个过程无需用户编写任何流控代码——HTTP/2 的流量控制通过 WINDOW_UPDATE 控制可发送字节额度;当额度耗尽,发送侧写入会自然阻塞/挂起。Rust `Future` 的 Pending 机制与 HTTP/2 Flow Control 能很好地组合,让背压从网络层自然传导到应用层。 + +### 5. 类型安全:编译期的通信契约 + +MPI 操作 `void*` 裸指针,ZMQ 传输二进制 Blob,消息类型错误只能在运行时发现。Pulsing 利用 Rust 的类型系统,将通信契约提前到了编译期。 + +通过 Behavior API(灵感来自 Akka Typed),Actor 的消息类型在编译时就被检查: + +```rust +// 定义一个类型安全的计数器 Actor +fn counter(initial: i32) -> Behavior { + stateful(initial, |count, msg, ctx| match msg { + CounterMsg::Increment(n) => { + *count += n; + BehaviorAction::Same // 保持当前行为 + } + CounterMsg::Reset => { + *count = 0; + BehaviorAction::Same + } + }) +} + +// TypedRef 保证只能发送 CounterMsg 类型的消息 +let counter: TypedRef = system.spawn(counter(0)); +counter.tell(CounterMsg::Increment(5)).await?; // ✅ 编译通过 +// counter.tell("hello").await?; // ❌ 编译错误 +``` + +更重要的是,`BehaviorAction::Become` 支持 Actor 在不同状态间安全切换——这对于实现 AI Agent 的状态机(如 `Idle → Thinking → Answering → Idle`)非常自然。 + +--- + +## 总结:定位与协作 + +四代通信技术的演进,本质上是在回答一个不断深化的问题: + +| 代际 | 核心问题 | 回答 | +|------|----------|------| +| MPI | 如何让一群进程高效地同步数据? | 集合操作 + BSP 同步 | +| ZMQ | 如何让任意两个进程随时通信? | 异步消息原语 | +| RPC | 如何让远程通信像本地调用一样可靠? | Call 语义 + IDL 契约 | +| Actor | 如何让一群进程作为一个系统协同工作? | 消息传递 + 监督 + 集群感知 | + +每一代技术都不是对上一代的否定,而是在新的维度上的补全。 + +如果说 **MPI/NCCL** 是 AI 基础设施的**高速公路**——负责大规模张量传输的数据面,那么 **Pulsing** 就是**智能交通指挥系统**——负责复杂编排逻辑的控制面: + +- 它不像 MPI 那样僵化,能适应动态的 Agent 拓扑。 +- 它不像 ZMQ 那样原始,提供了完善的集群治理和通信范式。 +- 它不像 RPC 那样依赖繁重的外挂设施,保持了极致的轻量和低延迟。 +- 它天然支持有状态编排,契合 AI 场景中 KV Cache 和 Agent Memory 的需求。 + +在技术演进的螺旋中,Actor 模型借由 Rust 的力量,再次成为构建下一代分布式智能系统的最佳原语。 + +--- + +## 进一步阅读 + +- [Actor 系统设计](actor-system.zh.md) —— Pulsing 核心架构 +- [HTTP/2 传输层](http2-transport.zh.md) —— 流式与背压的实现细节 +- [节点发现](node-discovery.zh.md) —— Gossip + SWIM 的组网机制 +- [Actor 寻址](actor-addressing.zh.md) —— URI 地址体系设计 +- [Behavior API](behavior.zh.md) —— 类型安全的函数式 Actor