diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index 4b54b8c5e..d46a09fc5 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -118,6 +118,7 @@ plugins:
HTTP2 Transport: HTTP2 传输
Load Sync: 负载同步
AS Actor Decorator: AS Actor 装饰器
+ Communication Evolution: 集群内通信技术演进
- mkdocstrings:
handlers:
python:
@@ -174,6 +175,7 @@ nav:
- HTTP2 Transport: design/http2-transport.md
- Load Sync: design/load_sync.md
- AS Actor Decorator: design/as-actor-decorator.md
+ - Communication Evolution: design/cluster-communication-evolution.md
extra:
generator: false
diff --git a/docs/src/design/cluster-communication-evolution.md b/docs/src/design/cluster-communication-evolution.md
new file mode 100644
index 000000000..4a2454b72
--- /dev/null
+++ b/docs/src/design/cluster-communication-evolution.md
@@ -0,0 +1,448 @@
+# Evolution of In-Cluster Communication: From MPI to Modern Actors
+
+When building distributed systems—especially modern AI infrastructure—one of the earliest architectural decisions is: **how should components talk to each other?**
+
+Communication paradigms shape a system’s DNA. This article reviews four generations of in-cluster communication—MPI, ZMQ, RPC, and Actors—focusing on what each generation tried to solve, and what it *cost* in terms of complexity and constraints.
+
+Finally, we explain why the **Actor model** is experiencing a modern renaissance at the intersection of the Rust ecosystem and AI orchestration—and why it forms the foundation of the Pulsing framework.
+
+---
+
+## Overview: The Law of Conservation of Complexity
+
+The core challenge of distributed systems is **uncertainty** (latency, failures, reordering). The evolution of communication technology is essentially a shift of responsibility: **who absorbs and manages that uncertainty**.
+
+```mermaid
+graph LR
+ A["MPI: Eliminate uncertainty
(Assume a deterministic world)"] --> B["ZMQ: Expose uncertainty
(Tools are given; you handle it)"]
+ B --> C["RPC: Mask uncertainty
(Pretend the network doesn't exist)"]
+ C --> D["Actor: Internalize uncertainty
(Make it part of the programming model)"]
+```
+
+| Dimension | MPI | ZMQ | RPC | Actor |
+|------|-----|-----|-----|-------|
+| **Core metaphor** | Army (marching in sync) | Walkie-talkie (free channels) | Phone call (1-to-1) | Mail system (async delivery) |
+| **Control plane** | Static (fixed at launch) | Manual (built by developers) | Bolt-on (Service Mesh) | Built-in (Gossip/Supervision) |
+| **State management** | Tightly coupled (SPMD) | None | Externalized (Redis/DB) | Memory-resident (stateful) |
+| **Transport base** | TCP/RDMA (long-lived) | TCP (socket abstraction) | TCP/HTTP/2/QUIC (implementation-dependent; e.g. gRPC = HTTP/2) | **HTTP/2 (multiplexing)** |
+| **Fault philosophy** | Crash-stop | Dev-dependent | Retry + circuit-break | Let it crash + restart |
+| **Best fit** | Data plane (tensor sync) | Transport plumbing | Business plane (CRUD services) | Control plane (complex orchestration) |
+
+---
+
+## Generation 1: MPI — Extreme Performance in a Static World
+
+### The monarch of rigid synchronization
+
+MPI (Message Passing Interface) was born in HPC. Its worldview is **static and perfect**: all nodes are ready at startup, the network is reliable, and compute loads are balanced.
+
+Under the BSP (Bulk Synchronous Parallel) model, MPI abstracts communication as **collectives**—all participants perform the same communication step at the same time:
+
+```text
+ Compute phase Barrier Communication phase
+Rank 0: [██████████] ───── ▐ ──────────▶ AllReduce
+Rank 1: [████████] ───── ▐ ──────────▶ AllReduce
+Rank 2: [████████████] ──── ▐ ──────────▶ AllReduce
+Rank 3: [████████] ───── ▐ ──────────▶ AllReduce
+ ▲
+ Barrel effect: the slowest rank dictates the pace
+```
+
+This “everyone must arrive” rigidity is both the source of MPI’s peak performance and the root of its limitations.
+
+### Why MPI remains irreplaceable
+
+In the **data plane** of AI training—e.g. gradient synchronization via AllReduce—MPI (and its GPU-specialized descendant NCCL) still dominates.
+
+The reason is the optimization space enabled by **predefined patterns**:
+- Topology is known (Ring, Tree, Recursive Halving-Doubling), enabling precise routing for hardware (NVLink, InfiniBand, PCIe).
+- Buffer sizes are known, enabling DMA zero-copy and pipelined overlap (compute/comm overlap).
+- Participant sets are fixed, enabling scheduling and tuning around a stable world.
+
+In short, MPI’s power comes from **maximally exploiting determinism**.
+
+### Limitation: when determinism crumbles
+
+As AI systems evolve beyond pure data parallelism, MPI’s “determinism assumptions” often stop holding:
+
+| Scenario | MPI assumption | Reality |
+|------|------------|------|
+| Pipeline parallelism | stages are homogeneous | stage compute differs; irregular point-to-point is needed |
+| Inference serving | uniform load | request arrival is stochastic; dynamic load balancing is needed |
+| Agent collaboration | fixed topology | collaboration relationships change at runtime |
+| Elastic training | fixed node count | nodes may fail; scaling is required |
+
+More critically, MPI’s **fault tolerance is close to zero**—one rank failure often forces communicator rebuild. On thousand-GPU clusters, failures are not “if”, but “how often”.
+
+> MPI answers “how to communicate efficiently in a deterministic world”, but struggles with “what to do when the world is uncertain”.
+
+---
+
+## Generation 2: ZMQ — Freedom, with Sharp Edges
+
+### From collectives to point-to-point
+
+MPI’s core limitation is that it assumes everyone participates in each step. When you only need “A sends a message to B”, collectives are too heavy.
+
+ZeroMQ (ØMQ) emerged as “sockets with batteries”, offering flexible async messaging primitives for arbitrary topologies:
+
+```text
+MPI world (regular): ZMQ world (free-form):
+
+ 0 ←──AllReduce──▶ 1 0 ──push──▶ 1
+ ▲ ▲ │ │
+ │ AllReduce │ ▼ ▼
+ ▼ ▼ 2 ◀──req──── 3
+ 2 ←──AllReduce──▶ 3 pub
+ │
+ Everyone must participate 4 ◀──sub── 5
+ any connection, any time
+```
+
+REQ/REP, PUB/SUB, PUSH/PULL, DEALER/ROUTER, etc. make it easy to wire diverse patterns, with very high performance via async I/O and multiplexing.
+
+### Problem: mechanism without policy
+
+ZMQ intentionally stays at the **transport layer**. It provides mechanisms, but not application-level policy. Coordination becomes the developer’s job.
+
+In practice, this often shows up as:
+
+**Stalls / “waiting forever”**—easy to trigger when application logic doesn’t coordinate well (not always a kernel deadlock; often waiting for an event that will never happen):
+
+```text
+❌ Scenario 1: REQ/REP finite-state rules get violated
+
+ # A REQ socket must follow: send -> recv -> send -> recv ...
+ A(REQ): send(req1) → send(req2) # second send may block or raise EFSM (binding/config dependent)
+ A(REQ): recv(resp1) # response ordering assumptions are also easy to get wrong
+
+❌ Scenario 2: DIY correlation (DEALER/ROUTER) misses a branch
+
+ A: send(req, id=42) → await(resp, id=42)
+ B: receives req(id=42) but crashes/reconnects/forgets to reply in some branch
+ A: waits forever for resp(id=42) (no built-in paradigm forcing “must reply” or “must timeout”)
+```
+
+**App-perceived “message loss”**—common with PUB/SUB and during reconnect / subscription propagation:
+
+```text
+❌ Scenario: subscription not established yet (or reconnect window)
+
+ Pub: send(msg) # publisher emits
+ Sub: [not connected / subscription filter not delivered yet] # subscriber not “ready”
+ Sub: recv() # msg is not replayed (PUB/SUB does not retain history for late joiners)
+
+Notes:
+ - In non-PUB/SUB patterns, ZMQ often buffers in in-process memory queues; but those queues are in-process.
+ Process crashes, HWM limits, or non-blocking sends (EAGAIN) can still look like “it never arrived”.
+```
+
+**Backpressure “gotchas”**—when consumption can’t keep up with production:
+
+- ZMQ provides **High Water Mark (HWM)** and related knobs, but behavior depends heavily on socket type and send mode:
+ it may block, it may return EAGAIN (non-blocking), and in PUB/SUB some situations can drop.
+- Structurally, it lacks an end-to-end, semantics-aligned backpressure paradigm: how to express *must be ordered, must not drop tokens*,
+ plus consistent cancellation/timeout/retry semantics.
+- In AI inference (e.g. LLM token streaming), sustained producer > consumer often becomes a trade-off: buffer memory, drop data, or block and risk cascaded stalls.
+
+Developers end up implementing heartbeats, ACKs, retry queues, and connection state machines—often the most fragile part of the system.
+
+> ZMQ answers “how to let any node talk to any node”, but pushes “how to be reliable” onto every developer.
+
+---
+
+## Generation 3: RPC — The Cost of Pretending to be Local
+
+### Call semantics: rescuing structure from ZMQ chaos
+
+Facing the “I sent a request, will I get a response?” problem, RPC wraps remote communication into **call semantics**.
+
+```python
+# ZMQ: two steps; manual pairing; forget recv and you stall
+socket.send(request)
+response = socket.recv()
+
+# RPC: request/response are bound by language semantics
+response = service.compute(request)
+```
+
+This is more than syntax sugar. Call semantics impose three constraints:
+
+1. **Request-response pairing**: you can’t “send and forget to receive”.
+2. **Deadline/timeout mechanisms**: most frameworks support deadlines, but you typically must **set them explicitly** (many defaults can wait “forever”).
+3. **IDL contracts**: Protobuf/IDL fixes the protocol at build time.
+
+### The fallacies of distributed computing
+
+RPC’s promise—**remote calls feel local**—is a leaky abstraction. Peter Deutsch’s “Eight Fallacies of Distributed Computing” remain true:
+
+- The network is reliable
+- Latency is zero — in practice often \(10^3\) to \(10^6\)× gaps depending on serialization, kernel, network, and queuing
+- Bandwidth is infinite
+- The network is secure
+- Topology doesn’t change
+- There is one administrator
+- Transport cost is zero
+- The network is homogeneous
+
+RPC tends to encourage code that “pretends the network doesn’t exist”; when it fails, error handling becomes a patch afterwards.
+
+### The “microservice tax”
+
+RPC is fundamentally point-to-point and lacks a unified **cluster view**. In practice, as systems scale, teams often stack supporting infrastructure (not every system needs all of it, but it’s common at scale):
+
+```text
+What a “simple” RPC often looks like in reality:
+
+[Service A] ──IPC── [Sidecar/Envoy] ──network── [Sidecar/Envoy] ──IPC── [Service B]
+ ▲ ▲
+ │ Control Plane │
+ └──── [Etcd] [Consul] ───────┘
+ ▲
+ [Prometheus] [Jaeger] [Sentinel]
+```
+
+- **Service discovery**: Etcd / Consul / ZooKeeper — where is the service?
+- **Traffic governance**: Envoy / Nginx — LB, circuit breaking, rate limiting
+- **Sidecars**: separating governance from business code
+
+Each component is a system you deploy and operate—this is the “microservice tax”.
+
+### The stateless fallacy
+
+For AI systems, RPC often conflicts with **stateless-first architecture**.
+
+Microservices frequently push stateless services (state in Redis/DB) for horizontal scaling and failover. This is great for CRUD, but often very costly for AI:
+
+- **KV cache**: LLM inference context must reside in GPU memory. Pulling from Redis, deserializing, and loading into GPU per request is too slow.
+- **Agent memory**: multi-turn state and reasoning traces need to persist.
+- **Model weights**: loading large models takes tens of seconds; per-request reload is impossible.
+
+> RPC answers “how to make remote calls reliable”, but its ecosystem tends to rely on bolt-on clustering, and stateless-first designs clash with AI reality.
+
+---
+
+## Generation 4: Actor — Embracing Uncertainty
+
+### From masking to internalizing
+
+The first three generations respond to network uncertainty by eliminating it (MPI), exposing it (ZMQ), or masking it (RPC). The Actor model takes a fourth path: **make uncertainty part of the programming model**.
+
+Originating with Carl Hewitt (1973) and popularized by Erlang (1986) and Akka (2009), its core principles are:
+
+**1. Message passing, not function calls**
+
+Actors don’t “call” each other; they **deliver messages**. This small shift changes everything:
+
+- Sending is async and non-blocking—senders don’t stall because receivers are slow.
+- Each Actor has a **mailbox** as a natural buffer, decoupling producers and consumers.
+- Delivery is “best-effort”, forcing failure to be part of the design, not an afterthought.
+
+**2. Private state, not shared memory**
+
+Actors encapsulate private state, accessible only via messages:
+
+- No locks, no races—concurrency safety is architecturally guaranteed.
+- State is memory-resident—naturally fitting KV caches and agent memory.
+
+**3. “Let it crash”**
+
+Instead of preventing all failures, accept failures as normal and build **supervision**: parent actors monitor children and apply policies (restart/stop/escalate) when failures happen.
+
+```text
+ [Supervisor]
+ / \
+ [Worker A] [Worker B]
+ ↓ ↓
+ crash! running
+ ↓
+ auto-restart ✓
+```
+
+This “isolate + restart” pattern gives the system self-healing properties: a single actor crash doesn’t take down the whole system.
+
+### Why actors synthesize the strengths of the previous three
+
+| Previous pain point | Actor answer |
+|----------|-------------|
+| MPI: hard to express irregular comms | any actors can communicate anytime |
+| ZMQ: no coordination paradigm | Ask/Tell/Stream provide semantic constraints |
+| RPC: lacks cluster capability | built-in discovery + location transparency |
+| RPC: stateless limitation | actors are naturally stateful, memory-resident |
+| All: fragile fault tolerance | supervision + automatic restart |
+
+---
+
+## Pulsing: Modern Actors Powered by Rust
+
+The Actor model is conceptually strong, but earlier implementations had trade-offs—Erlang’s performance ceiling, Akka’s JVM GC pauses, Ray’s Python+C++ complexity.
+
+Pulsing modernizes the Actor model using **Rust + Tokio + HTTP/2**, with concrete engineering decisions across five dimensions.
+
+### 1. Zero-dependency clustering: Gossip + SWIM
+
+RPC stacks often need bolt-on Etcd/Consul for discovery. Pulsing builds clustering into the Actor System.
+
+Each node only needs a seed address. Membership is exchanged via Gossip; failures are detected via SWIM:
+
+```text
+Startup (Kubernetes example):
+
+ New Pod Service IP Existing Pods
+ │ │ │
+ ├── Probe 1 (Join) ────▶ ├── routed to Pod A ─▶ │
+ │◀── Welcome [A] ───────┤ │
+ │ │ │
+ ├── Probe 2 (Join) ────▶ ├── routed to Pod B ─▶ │
+ │◀── Welcome [A,B] ─────┤ │
+ │ │ │
+ └── normal Gossip begins ────────────────────────┘
+ then every ~200ms sync with random peers
+```
+
+On failure, SWIM’s Ping → Suspect → Dead state machine cleans up membership and actor registrations:
+
+- Ping timeout → Suspect
+- Suspect timeout → Dead (removed from members)
+- actor registrations on the dead node are cleaned up
+
+**Key design**: all communication (actor messages + gossip + health checks) shares a single HTTP port—simplifying networking and firewall rules.
+
+### 2. Location-transparent addressing: a unified Actor URI scheme
+
+In RPC, clients must know service endpoints. Pulsing provides a URI-style address scheme so physical location is transparent:
+
+```text
+actor:///services/llm/router → named actor (cluster-routed)
+actor:///services/llm/router@node_a → specific node instance
+actor://node_a/worker_123 → globally precise address
+actor://localhost/worker_123 → local shorthand
+```
+
+The same named actor can run multiple instances across nodes; the system load-balances automatically:
+
+```python
+# deployed on Node A
+await system.spawn(LLMRouter(), name="services/llm/router")
+
+# deployed on Node B (same name)
+await system.spawn(LLMRouter(), name="services/llm/router")
+
+# resolved from any node; instance chosen automatically
+router = await system.resolve("services/llm/router")
+result = await router.ask(request) # routed to A or B
+```
+
+This removes the need for external LBs while keeping the API minimal.
+
+### 3. Stateful orchestration: Actors as owners of state
+
+This is the most fundamental advantage of Actors over stateless RPC. In Pulsing, actors are **owners** of state:
+
+```python
+@pul.remote
+class InferenceWorker:
+ def __init__(self, model_path: str):
+ self.model = load_model(model_path) # weights stay in memory
+ self.kv_cache = {} # KV cache stays in memory
+
+ async def generate(self, prompt: str):
+ # reuse in-memory model and cache; no per-request DB load
+ for token in self.model.generate(prompt, cache=self.kv_cache):
+ yield {"token": token}
+```
+
+Implications:
+- **KV cache residency**: no per-request serialize/deserialize round-trips.
+- **Weight residency**: load once, serve for the lifetime of the process.
+- **Agent memory residency**: multi-turn context lives inside actor state.
+
+### 4. Streaming & backpressure: HTTP/2 end-to-end flow control
+
+LLM token streaming is a canonical AI communication pattern. Traditional RPC streaming can be awkward; ZMQ can be tricky under backpressure.
+
+Pulsing uses **h2c (HTTP/2 over cleartext)** to leverage HTTP/2 multiplexing and flow control:
+
+**Connection multiplexing**: nodes keep one long-lived TCP connection; Ask/Tell/Stream map to independent HTTP/2 streams. No per-actor connection explosion.
+
+**End-to-end backpressure**: HTTP/2 Flow Control Windows automatically throttle producers when consumers slow down:
+
+```text
+When producer rate > consumer rate:
+
+ [LLM Actor] [Network] [Client]
+ │ │ │
+ │── Token ────────▶ │── Token ────────▶ │
+ │── Token ────────▶ │── Token ────────▶ │ ← client slows
+ │── Token ────────▶ │ H2 window full │
+ │ send() Pending │◀─ stop ACKs ─────│ ← window exhausted
+ │ ← generation pauses automatically │
+ │ │ │ ← client catches up
+ │ send() resumes │◀─ Window Update ─│ ← window freed
+ │── Token ────────▶ │── Token ────────▶ │
+```
+
+No user-written flow-control code is required: Rust `Future` Pending semantics and HTTP/2 flow control compose naturally, propagating backpressure from network to application.
+
+### 5. Type safety: compile-time communication contracts
+
+MPI often manipulates `void*`; ZMQ sends blobs; type errors surface at runtime. Pulsing pushes contracts to compile time via Rust’s type system.
+
+With the Behavior API (inspired by Akka Typed), message types are checked at compile time:
+
+```rust
+// A type-safe counter actor
+fn counter(initial: i32) -> Behavior {
+ stateful(initial, |count, msg, ctx| match msg {
+ CounterMsg::Increment(n) => {
+ *count += n;
+ BehaviorAction::Same // keep current behavior
+ }
+ CounterMsg::Reset => {
+ *count = 0;
+ BehaviorAction::Same
+ }
+ })
+}
+
+// TypedRef ensures only CounterMsg can be sent
+let counter: TypedRef = system.spawn(counter(0));
+counter.tell(CounterMsg::Increment(5)).await?; // ✅ compiles
+// counter.tell("hello").await?; // ❌ compile error
+```
+
+More importantly, `BehaviorAction::Become` allows safe behavior switches—natural for building agent state machines like `Idle → Thinking → Answering → Idle`.
+
+---
+
+## Summary: Positioning & Collaboration
+
+The evolution of these four generations is about answering progressively deeper questions:
+
+| Generation | Core question | Answer |
+|------|----------|------|
+| MPI | How do many processes synchronize data efficiently? | collectives + BSP |
+| ZMQ | How can any two processes talk at any time? | async messaging primitives |
+| RPC | How can remote communication feel reliable like local calls? | call semantics + IDL |
+| Actor | How can a group of processes behave like one system? | message passing + supervision + cluster awareness |
+
+Each generation doesn’t negate the previous; it complements it in new dimensions.
+
+If **MPI/NCCL** is the **superhighway** of AI infrastructure—the data plane for massive tensor transport—then **Pulsing** is the **intelligent traffic control system**—the control plane for complex orchestration:
+
+- not rigid like MPI; adapts to dynamic agent topologies
+- not primitive like ZMQ; provides governance and clear paradigms
+- not heavy like RPC stacks; remains lightweight and low-latency
+- naturally stateful; fits KV cache and agent memory realities
+
+In the spiral of technology, the Actor model—powered by Rust—returns as a strong primitive for the next generation of distributed intelligent systems.
+
+---
+
+## Further Reading
+
+- [Actor System Design](actor-system.md) — Pulsing core architecture
+- [HTTP/2 Transport](http2-transport.md) — streaming and backpressure details
+- [Node Discovery](node-discovery.md) — Gossip + SWIM clustering
+- [Actor Addressing](actor-addressing.md) — URI address system
+- [Behavior API](behavior.md) — type-safe functional actors
diff --git a/docs/src/design/cluster-communication-evolution.zh.md b/docs/src/design/cluster-communication-evolution.zh.md
new file mode 100644
index 000000000..fc5f9426f
--- /dev/null
+++ b/docs/src/design/cluster-communication-evolution.zh.md
@@ -0,0 +1,450 @@
+# 集群内通信技术演进:从 MPI 到现代 Actor
+
+在构建分布式系统(特别是现代 AI 基础设施)时,架构师面临的首要决策往往是:**我们该如何对话?**
+
+通信范式的选择决定了系统的基因。本文将从技术演进的视角,深度剖析分布式通信技术的四个代际:MPI、ZMQ、RPC 以及 Actor。我们将深入探讨每一代技术试图解决的根本问题,以及它们为此付出的代价。
+
+最终,我们将揭示为什么在 Rust 生态和 AI 编排的交汇点上,**Actor 模型**正在经历一次现代化的复兴,并成为 Pulsing 框架的核心基石。
+
+---
+
+## 演进总览:复杂性守恒定律
+
+分布式系统的核心挑战在于**不确定性**(网络延迟、节点故障、消息乱序)。通信技术的演进,本质上是关于"谁来处理这些不确定性"的权力转移过程。
+
+```mermaid
+graph LR
+ A["MPI: 消除不确定性
(假设世界是确定的)"] --> B["ZMQ: 暴露不确定性
(给你工具,自己处理)"]
+ B --> C["RPC: 掩盖不确定性
(假装网络不存在)"]
+ C --> D["Actor: 内化不确定性
(将其纳入编程模型)"]
+```
+
+| 维度 | MPI | ZMQ | RPC | Actor |
+|------|-----|-----|-----|-------|
+| **核心隐喻** | 军队 (同步行进) | 对讲机 (自由频道) | 电话 (一对一呼叫) | 邮件系统 (异步投递) |
+| **控制平面** | 静态 (启动即固定) | 手动 (开发者自建) | 外挂 (Service Mesh) | 内建 (Gossip/成员管理/寻址) |
+| **状态管理** | 紧耦合 (SPMD) | 无 | 外部化 (Redis/DB) | 内存驻留 (Stateful) |
+| **通信基座** | TCP/RDMA (长连接) | TCP (Socket抽象) | TCP/HTTP/2/QUIC(取决于实现;如 gRPC=HTTP/2) | **HTTP/2 (多路复用)** |
+| **容错哲学** | 崩溃即停止 | 依赖开发者 | 重试 + 熔断 | Let it crash +(按策略恢复) |
+| **适用领域** | 数据面 (Tensor 同步) | 传输层 (底层管道) | 业务面 (CRUD 服务) | 控制面 (复杂编排) |
+
+---
+
+## 第一代:MPI —— 静态世界的极致性能
+
+### 刚性同步的霸主
+
+MPI (Message Passing Interface) 诞生于高性能计算 (HPC) 领域。它的世界观是**静态且完美**的:所有节点在启动时都已就绪,网络是可靠的,计算任务是均匀的。
+
+在 BSP (Bulk Synchronous Parallel) 模型下,MPI 将通信抽象为**集合操作 (Collectives)**——所有参与者在同一时刻执行相同的通信操作:
+
+```text
+ 计算阶段 Barrier 通信阶段
+Rank 0: [██████████] ───── ▐ ──────────▶ AllReduce
+Rank 1: [████████] ───── ▐ ──────────▶ AllReduce
+Rank 2: [████████████] ──── ▐ ──────────▶ AllReduce
+Rank 3: [████████] ───── ▐ ──────────▶ AllReduce
+ ▲
+ 木桶效应:最慢的节点决定整体速度
+```
+
+这种"所有人必须到齐"的刚性同步,是 MPI 性能极致的来源,也是它局限性的根源。
+
+### 为什么 MPI 至今无法被替代?
+
+在 AI 训练的**数据面 (Data Plane)**——例如数据并行中的梯度同步(AllReduce)——MPI(及其 GPU 特化版本 NCCL)仍然是绝对的统治者。
+
+原因在于**通信模式的预定义性**带来的优化空间:
+- 通信拓扑是已知的(Ring, Tree, Recursive Halving-Doubling),可以针对硬件拓扑(NVLink, InfiniBand, PCIe)做精确的路径规划。
+- Buffer 大小是已知的,可以做 DMA 零拷贝、Pipeline 重叠(计算与通信重叠)。
+- 参与者集合是固定的,可以做编译期的通信调度优化。
+
+换句话说,MPI 的威力来自于**对确定性世界的极致利用**。
+
+### 局限:确定性假设的崩塌
+
+然而当 AI 系统从单纯的数据并行演化到更复杂的形态时,MPI 的"确定性假设"开始崩塌:
+
+| 场景 | MPI 的假设 | 现实 |
+|------|------------|------|
+| 流水线并行 | 所有 Stage 同构 | Stage 间计算量差异大,需要不规则的点对点通信 |
+| 推理服务 | 负载均匀 | 请求到达率随机,需要动态负载均衡 |
+| Agent 协作 | 通信拓扑固定 | Agent 间的协作关系在运行时动态变化 |
+| 弹性训练 | 节点数固定 | 节点可能故障、可能需要扩缩容 |
+
+更致命的是,MPI 的**容错几乎为零**——一个 Rank 挂掉,整个 Communicator 就需要重建。在千卡集群上,节点故障不是"如果"的问题,而是"多久一次"的问题。
+
+> MPI 解决了"如何在确定世界中高效通信",但它无法回答"当世界变得不确定时怎么办"。
+
+---
+
+## 第二代:ZMQ —— 自由但危险的积木
+
+### 从集合通信到点对点
+
+MPI 的核心局限在于:它假设所有节点在每个通信步骤中都参与。当我们只需要"节点 A 给节点 B 发一条消息"时,MPI 的集合操作就显得笨重了。
+
+ZeroMQ (ØMQ) 应运而生。它自称"带电池的 Socket",提供了一套灵活的异步消息原语,让任意两个节点可以随时通信:
+
+```text
+MPI 的世界(规则通信): ZMQ 的世界(自由通信):
+
+ 0 ←──AllReduce──▶ 1 0 ──push──▶ 1
+ ▲ ▲ │ │
+ │ AllReduce │ ▼ ▼
+ ▼ ▼ 2 ◀──req──── 3
+ 2 ←──AllReduce──▶ 3 pub
+ │
+ 所有人必须参与 4 ◀──sub── 5
+ 任意连接,任意时刻
+```
+
+ZMQ 通过 REQ/REP、PUB/SUB、PUSH/PULL、DEALER/ROUTER 等 Socket 模式,让开发者能灵活地搭建任意通信拓扑。它实现了**真正的异步 IO 和多路复用**,性能极高。
+
+### 问题:只有机制 (Mechanism),没有策略 (Policy)
+
+ZMQ 的伟大之处在于它解构了通信原语,但它刻意停留在**传输层**,拒绝提供应用层的通信策略。这意味着所有**协调的责任**被完全交给了开发者。
+
+在实际项目中,这导致了几类典型的痛点:
+
+**卡死/等不到消息**——两端逻辑配合不当时很容易发生(这不一定是底层 socket 的“死锁”,更多是应用协议层面的“等待永远不会发生的事件”):
+
+```text
+❌ 场景 1:REQ/REP 的状态机约束被破坏
+
+ # REQ socket 必须严格遵循:send -> recv -> send -> recv ...
+ A(REQ): send(req1) → send(req2) # 第二次 send 可能阻塞或报 EFSM(取决于语言绑定/配置)
+ A(REQ): recv(resp1) # 逻辑上期望的 resp1/resp2 顺序也容易被写乱
+
+❌ 场景 2:DEALER/ROUTER 自己实现相关性(correlation)时漏掉一种分支
+
+ A: send(req, id=42) → await(resp, id=42)
+ B: 收到 req(id=42),处理过程中异常退出/重连/逻辑分支漏回包
+ A: 永远等不到 id=42 的 resp(没有统一的范式来强制“必回包/必超时”)
+```
+
+**消息“看似丢失”**——常见于 PUB/SUB 以及重连/订阅传播阶段:
+
+```text
+❌ 场景:订阅尚未建立(或重连期间)
+
+ Pub: send(msg) # 发布端发出
+ Sub: [尚未 connect / 尚未发送订阅过滤器] # 订阅者还没“准备好”
+ Sub: recv() # 这条 msg 不会补发(PUB/SUB 默认不为晚到订阅者保留历史)
+
+说明:
+ - 对于非 PUB/SUB 模式,ZMQ 往往会在发送端/中间层做内存队列缓冲;但队列是内存态的,
+ 进程崩溃、队列达到 HWM、或使用 non-blocking 发送时,都可能表现为“消息没到/没了”。
+```
+
+**背压 (Backpressure) 失控**——当消费速度跟不上生产速度时:
+
+- ZMQ 提供了 **高水位 (HWM)** 等机制,但其行为强烈依赖 socket 类型与发送方式:可能阻塞、可能返回 EAGAIN(non-blocking),PUB/SUB 在一些情况下还可能直接丢弃。
+- 更关键的是:它缺少一个“端到端”的、与业务语义对齐的背压范式(比如:如何表达 *必须有序、不丢 token*,以及 *取消/超时/重试* 的一致语义)。
+- 在 AI 推理场景中(如 LLM Token 流式生成),一旦生产速度持续高于消费速度,就容易陷入“要么堆内存、要么丢数据、要么阻塞卡死”的工程权衡。
+
+开发者不得不手动实现心跳协议、ACK 确认、重试队列、连接状态机……这些 ad-hoc 代码往往成为系统中最脆弱的部分。
+
+> ZMQ 解决了"如何让任意节点随时通信",但它把"如何保证通信可靠"这个更难的问题留给了每一个开发者。
+
+---
+
+## 第三代:RPC —— 伪装成本地调用的代价
+
+### Call 语义:对 ZMQ 混乱的救赎
+
+面对 ZMQ "发了请求不知道收不收得到响应"的困境,RPC (Remote Procedure Call) 给出了一个优雅的解法:**用函数调用的语义来封装远程通信**。
+
+```python
+# ZMQ:两步操作,需要手动配对,忘记 recv 就卡死
+socket.send(request)
+response = socket.recv()
+
+# RPC:一行代码,请求和响应在语法层面就是绑定的
+response = service.compute(request)
+```
+
+这不仅仅是语法糖。RPC 的 Call 语义带来了三个关键约束:
+
+1. **请求-响应自动配对**:不可能出现"发了请求忘记收响应"。
+2. **超时/截止时间机制**:多数 RPC 框架提供 deadline/timeout,但通常需要**显式设置**(不少实现默认可以“无限等”)。
+3. **接口契约 (IDL)**:通过 Protobuf 等接口定义语言,编译期就确定了通信协议。
+
+### "分布式计算的误区"
+
+然而 RPC 的美好愿景——**让远程调用看起来像本地调用**——本身就是一个危险的抽象泄漏。Peter Deutsch 提出的“分布式计算的八大误区 (Fallacies of Distributed Computing)”至今仍然成立:
+
+- 网络是可靠的(The network is reliable)
+- 延迟为零(Latency is zero)——现实中常见差异在 \(10^3\) 甚至 \(10^6\) 倍量级(取决于序列化、内核、网络、排队)
+- 带宽是无限的(Bandwidth is infinite)
+- 网络是安全的(The network is secure)
+- 拓扑不变(Topology doesn't change)
+- 只有一个管理员(There is one administrator)
+- 传输成本为零(Transport cost is zero)
+- 网络是同质的(The network is homogeneous)
+
+RPC 试图掩盖这些现实,诱导开发者写出"假装网络不存在"的代码。当网络真的出问题时,错误处理往往是事后补丁。
+
+### "微服务税" (The Microservice Tax)
+
+RPC 更深层的问题在于:它本质上只是一个**点对点的调用协议**,没有"集群"的统一视图。为了让 RPC 在大规模系统中可用,工程上经常会在它之上堆叠大量配套设施(不一定每个系统都需要,但一旦规模上来就很常见):
+
+```text
+一个"简单"的 RPC 调用,实际经过的路径:
+
+[Service A] ──IPC── [Sidecar/Envoy] ──network── [Sidecar/Envoy] ──IPC── [Service B]
+ ▲ ▲
+ │ Control Plane │
+ └──── [Etcd] [Consul] ───────┘
+ ▲
+ [Prometheus] [Jaeger] [Sentinel]
+```
+
+- **服务发现**:Etcd / Consul / ZooKeeper —— 告诉客户端"服务在哪"
+- **流量治理**:Envoy / Nginx —— 负载均衡、熔断限流
+- **Sidecar 模式**:把治理逻辑从业务代码中剥离出去
+
+每一项都是独立的系统,都需要部署和运维。这就是"微服务税"——你以为你只是在写业务逻辑,但实际上你在运维一个庞大的基础设施帝国。
+
+### 无状态的谬误
+
+对于 AI 系统来说,RPC 还有一个更根本的冲突:**无状态 (Stateless) 设计**。
+
+传统微服务架构常推崇无状态——每个请求都是独立的,状态存储在 Redis/DB 中,以便水平扩展与故障迁移。这对 CRUD 业务很合理,但对 AI 场景往往代价极高:
+
+- **KV Cache**:LLM 推理的上下文缓存需要驻留在 GPU 显存中。每次请求从 Redis 拉取、反序列化、加载到 GPU,延迟不可接受。
+- **Agent Memory**:智能体在多轮对话中积累的记忆和推理状态,天然需要持久驻留。
+- **模型权重**:加载一个大模型需要几十秒,不可能每个请求都重新加载。
+
+> RPC 解决了"如何让远程通信可靠",但它缺乏集群视角,且它的无状态哲学与 AI 场景天然冲突。
+
+---
+
+## 第四代:Actor —— 拥抱不确定性
+
+### 从掩盖到内化
+
+前三代技术对网络不确定性的态度分别是:消除它 (MPI)、暴露它 (ZMQ)、掩盖它 (RPC)。Actor 模型选择了第四条路:**将不确定性内化为编程模型的一部分**。
+
+这个思想最早由 Carl Hewitt 在 1973 年提出,后来被 Erlang (1986) 和 Akka (2009) 发扬光大。它的核心原则:
+
+**1. 消息传递,而非函数调用**
+
+Actor 之间不"调用"彼此,而是"投递消息"。这个看似微小的区别,带来了根本性的变化:
+
+- 发送是异步的、非阻塞的——发送端不会因为接收端慢而卡住。
+- 每个 Actor 有自己的**邮箱 (Mailbox)**,作为天然的缓冲区,解耦了生产者和消费者。
+- 消息的投递语义是"尽力而为",而不是"保证到达"——这迫使开发者从设计之初就考虑失败场景。
+
+**2. 私有状态,而非共享内存**
+
+每个 Actor 封装自己的状态,外部只能通过消息访问。这意味着:
+- 没有锁、没有竞态条件——并发安全是架构层面保证的。
+- 状态是**内存驻留**的——天然适合 AI 场景中的 KV Cache、Agent Memory。
+
+**3. "Let it crash" 哲学**
+
+与其试图预防所有故障(如 MPI 的做法),不如承认故障是常态,并建立**监督 (Supervision)** 机制:父 Actor 监控子 Actor,当子 Actor 崩溃时,按照预定义的策略(重启/停止/上报)进行处理。
+
+```text
+ [Supervisor]
+ / \
+ [Worker A] [Worker B]
+ ↓ ↓
+ 崩溃! 正常运行
+ ↓
+ 自动重启 ✓
+```
+
+这种"隔离 + 重启"的模式,让系统具备了**自愈能力**——单个 Actor 的失败不会拖垮整个系统。
+
+!!! note
+ 这里的“监督/重启”描述的是 Actor 模型的经典做法(如 Erlang/Akka)。在 Pulsing 中,目前提供的是 **actor 级别的重启策略**(如 Python `@pul.remote(restart_policy=...)`),并不等同于完整的 supervision tree(后者通常涉及父子层级、策略传播与结构化的失败处理)。
+
+### 为什么 Actor 能综合前三代的优势?
+
+回顾四代技术的痛点,Actor 模型的回答是系统性的:
+
+| 前代痛点 | Actor 的回答 |
+|----------|-------------|
+| MPI:无法不规则通信 | 任意 Actor 之间可以随时通信 |
+| ZMQ:缺乏协调范式 | Ask/Tell/Stream 提供了明确的语义约束 |
+| RPC:缺乏集群能力 | 内建服务发现、位置透明寻址 |
+| RPC:无状态限制 | Actor 天然有状态,状态内存驻留 |
+| 所有:容错脆弱 | 监督 + 自动重启 |
+
+---
+
+## Pulsing:Rust 加持下的现代 Actor
+
+Actor 模型的思想很好,但早期的实现各有不足——Erlang 的性能天花板、Akka 受限于 JVM 的 GC 暂停、Ray 的 Python+C++ 双层架构的复杂性。
+
+Pulsing 代表了 Actor 模型的一次**现代化重构**,它利用 **Rust + Tokio + HTTP/2** 技术栈,在四个核心维度上做出了具体的工程决策。
+
+### 1. 零依赖组网:Gossip + SWIM 自动集群
+
+RPC 系统需要外挂 Etcd/Consul 来做服务发现。Pulsing 将集群能力**内建**到了 Actor System 中。
+
+每个 Pulsing 节点启动时,只需知道一个种子地址。节点通过 Gossip 协议自动交换成员信息,通过 SWIM 协议检测故障:
+
+```text
+启动流程(以 Kubernetes 为例):
+
+ New Pod Service IP Existing Pods
+ │ │ │
+ ├── Probe 1 (Join) ────▶ ├── 路由到 Pod A ────▶ │
+ │◀── Welcome [A] ───────┤ │
+ │ │ │
+ ├── Probe 2 (Join) ────▶ ├── 路由到 Pod B ────▶ │
+ │◀── Welcome [A,B] ─────┤ │
+ │ │ │
+ └── 开始正常 Gossip ─────────────────────────────┘
+ 此后每 200ms 与随机节点同步状态
+```
+
+节点故障时,SWIM 协议通过 Ping → Suspect → Dead 状态机自动检测并清理:
+
+- Ping 超时 → 标记为 Suspect
+- Suspect 超时 → 标记为 Dead,从成员列表移除
+- 该节点上的所有 Actor 注册信息被自动清理
+
+**关键设计:所有通信(Actor 消息 + Gossip 协议 + 健康检查)共用一个 HTTP 端口**。这极大简化了网络配置和防火墙规则。
+
+### 2. 位置透明寻址:统一的 Actor 地址体系
+
+RPC 系统中,客户端必须知道服务的地址(IP:Port)。Pulsing 设计了一套 URI 风格的地址体系,让 Actor 的物理位置对使用者完全透明:
+
+```text
+actor:///services/llm/router → 具名 Actor(集群自动路由)
+actor:///services/llm/router@node_a → 指定节点实例
+actor://node_a/worker_123 → 全局精确地址
+actor://localhost/worker_123 → 本地快捷引用
+```
+
+同一个具名 Actor 可以在多个节点上部署实例,系统自动做负载均衡:
+
+```python
+# Node A 上部署
+await system.spawn(LLMRouter(), name="services/llm/router")
+
+# Node B 上也部署同名 Actor
+await system.spawn(LLMRouter(), name="services/llm/router")
+
+# 从任意节点访问——系统自动选择实例
+router = await system.resolve("services/llm/router")
+result = await router.ask(request) # 可能路由到 A 或 B
+```
+
+这解决了 RPC 需要外挂负载均衡器的问题,同时保持了 API 的极简。
+
+### 3. 有状态编排:Actor 作为状态的 Owner
+
+这是 Actor 相对于无状态 RPC 的最根本优势。在 Pulsing 中,Actor 是状态的**所有者 (Owner)**:
+
+```python
+@pul.remote
+class InferenceWorker:
+ def __init__(self, model_path: str):
+ self.model = load_model(model_path) # 模型权重常驻内存
+ self.kv_cache = {} # KV Cache 常驻内存
+
+ async def generate(self, prompt: str):
+ # 直接使用内存中的模型和缓存,无需每次从 DB 加载
+ for token in self.model.generate(prompt, cache=self.kv_cache):
+ yield {"token": token}
+```
+
+这意味着:
+- **KV Cache 常驻**:推理上下文不需要在请求间序列化/反序列化。
+- **模型权重常驻**:加载一次,服务终身。
+- **Agent 记忆常驻**:多轮对话的上下文直接存在 Actor 内存中。
+
+### 4. 流式与背压:HTTP/2 驱动的端到端流控
+
+LLM 的 Token 流式生成是 AI 场景中最典型的通信模式。传统 RPC 处理流式响应往往比较笨拙,ZMQ 则容易在背压处理上失控。
+
+Pulsing 基于 **h2c (HTTP/2 over cleartext)** 设计了传输层,天然利用 HTTP/2 的多路复用和流控机制:
+
+**连接复用**:节点间维持一条 TCP 长连接,所有 Actor 消息(Ask / Tell / Stream)作为独立的 HTTP/2 Streams 并行传输。不需要为每个 Actor 对建立独立连接。
+
+**端到端背压**:利用 HTTP/2 的 Flow Control Window 机制,实现从消费端到生产端的自动减速:
+
+```text
+Token 生成速度 > 消费速度时,自动减速过程:
+
+ [LLM Actor] [Network] [Client]
+ │ │ │
+ │── Token ────────▶ │── Token ────────▶ │
+ │── Token ────────▶ │── Token ────────▶ │ ← 处理变慢
+ │── Token ────────▶ │ H2 窗口填满 │
+ │ send() Pending │◀─ 不再发送 WINDOW_UPDATE ─│ ← 窗口耗尽
+ │ ← 自动暂停生成 │ │
+ │ │ │ ← 处理完成
+ │ send() 恢复 │◀─ Window Update ──│ ← 释放窗口
+ │── Token ────────▶ │── Token ────────▶ │
+```
+
+整个过程无需用户编写任何流控代码——HTTP/2 的流量控制通过 WINDOW_UPDATE 控制可发送字节额度;当额度耗尽,发送侧写入会自然阻塞/挂起。Rust `Future` 的 Pending 机制与 HTTP/2 Flow Control 能很好地组合,让背压从网络层自然传导到应用层。
+
+### 5. 类型安全:编译期的通信契约
+
+MPI 操作 `void*` 裸指针,ZMQ 传输二进制 Blob,消息类型错误只能在运行时发现。Pulsing 利用 Rust 的类型系统,将通信契约提前到了编译期。
+
+通过 Behavior API(灵感来自 Akka Typed),Actor 的消息类型在编译时就被检查:
+
+```rust
+// 定义一个类型安全的计数器 Actor
+fn counter(initial: i32) -> Behavior {
+ stateful(initial, |count, msg, ctx| match msg {
+ CounterMsg::Increment(n) => {
+ *count += n;
+ BehaviorAction::Same // 保持当前行为
+ }
+ CounterMsg::Reset => {
+ *count = 0;
+ BehaviorAction::Same
+ }
+ })
+}
+
+// TypedRef 保证只能发送 CounterMsg 类型的消息
+let counter: TypedRef = system.spawn(counter(0));
+counter.tell(CounterMsg::Increment(5)).await?; // ✅ 编译通过
+// counter.tell("hello").await?; // ❌ 编译错误
+```
+
+更重要的是,`BehaviorAction::Become` 支持 Actor 在不同状态间安全切换——这对于实现 AI Agent 的状态机(如 `Idle → Thinking → Answering → Idle`)非常自然。
+
+---
+
+## 总结:定位与协作
+
+四代通信技术的演进,本质上是在回答一个不断深化的问题:
+
+| 代际 | 核心问题 | 回答 |
+|------|----------|------|
+| MPI | 如何让一群进程高效地同步数据? | 集合操作 + BSP 同步 |
+| ZMQ | 如何让任意两个进程随时通信? | 异步消息原语 |
+| RPC | 如何让远程通信像本地调用一样可靠? | Call 语义 + IDL 契约 |
+| Actor | 如何让一群进程作为一个系统协同工作? | 消息传递 + 监督 + 集群感知 |
+
+每一代技术都不是对上一代的否定,而是在新的维度上的补全。
+
+如果说 **MPI/NCCL** 是 AI 基础设施的**高速公路**——负责大规模张量传输的数据面,那么 **Pulsing** 就是**智能交通指挥系统**——负责复杂编排逻辑的控制面:
+
+- 它不像 MPI 那样僵化,能适应动态的 Agent 拓扑。
+- 它不像 ZMQ 那样原始,提供了完善的集群治理和通信范式。
+- 它不像 RPC 那样依赖繁重的外挂设施,保持了极致的轻量和低延迟。
+- 它天然支持有状态编排,契合 AI 场景中 KV Cache 和 Agent Memory 的需求。
+
+在技术演进的螺旋中,Actor 模型借由 Rust 的力量,再次成为构建下一代分布式智能系统的最佳原语。
+
+---
+
+## 进一步阅读
+
+- [Actor 系统设计](actor-system.zh.md) —— Pulsing 核心架构
+- [HTTP/2 传输层](http2-transport.zh.md) —— 流式与背压的实现细节
+- [节点发现](node-discovery.zh.md) —— Gossip + SWIM 的组网机制
+- [Actor 寻址](actor-addressing.zh.md) —— URI 地址体系设计
+- [Behavior API](behavior.zh.md) —— 类型安全的函数式 Actor