diff --git a/docs/cloud/high-availability/deployment-patterns.mdx b/docs/cloud/high-availability/deployment-patterns.mdx new file mode 100644 index 0000000000..5c3af8e5d1 --- /dev/null +++ b/docs/cloud/high-availability/deployment-patterns.mdx @@ -0,0 +1,910 @@ +--- +id: deployment-patterns +title: Deployment patterns for High Availability +sidebar_label: Deployment patterns for High Availability +slug: /cloud/high-availability/deployment-patterns +description: Choose a Worker deployment pattern — Active / Passive (Cold), Active / Passive (Hot), or Active / Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. +tags: + - Temporal Cloud + - High Availability +keywords: + - high availability + - failover + - worker deployment + - active passive + - active active + - hot standby + - temporal cloud +--- + +When an outage strikes, a Namespace with [High Availability](/cloud/high-availability) fails over to another region automatically, but it does not move the rest of the architecture. +Workers, Workflow starters, Codec Servers, databases, and the external systems that Workflows depend on each need their own failover story. + +A critical piece of the [recovery time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment pattern**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. +This page describes common patterns for deploying Workers and the rest of the architecture to achieve an overall High Availability strategy. + +## What needs a failover story {/* #what-needs-a-failover-story */} + +Beyond the Namespace itself, these components live in the application environment and must be planned for: + +- **Workers** (the focus of this page) — execute Workflows and Activities. +- **Workflow starters and Clients** — start and signal Workflows. +- **Codec Servers** — encode and decode payloads for Workers, the Web UI, and the CLI. +- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path between Workers / Starters / Clients → Namespace. +- **Databases and queues** — the systems that Activities read and write. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef app fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef env fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph WFENV["Application environment"] + WFW["Workers"]:::app + WFCL["Workflow starters / Clients"]:::app + WFCODEC["Codec Server"]:::app + WFDB[("DB / queue")]:::app + WFPROXY["Proxy"]:::endpoint + end + WFNS["Namespace
(Temporal Cloud)"]:::ns + WFW <--> WFCODEC + WFW <--> WFDB + WFW --> WFPROXY + WFCL --> WFPROXY + WFPROXY --> WFNS + class WFENV env +``` + +Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, databases). +Because the right choice for each of these usually depends on where Workers run, **this page focuses on Worker deployment patterns**. + +:::tip + +See [High Availability for Temporal Cloud Namespaces](/cloud/high-availability) to learn more about Namespace replicas, replication, and failover. + +::: + +## Worker deployment patterns {/* #worker-deployment-patterns */} + +This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. +They trade off **recovery time** after an outage, **cost during normal operation**, and **operational complexity**, and differ in where the Workers run and where Workflows process: + +- **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants: + - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region. + - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby. +- **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times. + +:::info + +**Namespaces are always Active / Passive, but can support an Active / Active pattern.** + +A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. + +However, since both regions can serve requests and Worker polls, **Workers don't need to run in the same region as the active replica**, and Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. + +::: + +These patterns work across two cloud regions, which could be in the same cloud provider or different cloud providers: + +- **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." +- **Secondary region** — the region the Namespace fails over to. It can be any [Temporal Cloud region](/cloud/regions) that supports replication from the primary region. + +:::tip + +Multi-region Replication and Multi-cloud Replication generally use the same set of Worker deployment patterns, so this page will not distinguish between multi-region and multi-cloud. + +::: + +### Compare Worker deployment patterns at a glance (benefits and tradeoffs) {/* #compare-at-a-glance */} + +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | + +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef empty fill:transparent,stroke:#9aa4b2,stroke-width:1px,stroke-dasharray:4 3,color:#9aa4b2; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph CPRIM["Primary"] + subgraph CWP["Worker Pool"] + CW1["Worker"]:::worker + CW2["Worker"]:::worker + CW3["Worker"]:::worker + end + CNS["Namespace"]:::ns + CWP <-->|Workflows| CNS + end + subgraph CSEC["Secondary"] + CR["Replica"]:::ns + subgraph CWP2["Worker Pool"] + CE["      Empty      "]:::empty + end + CR ~~~ CWP2 + end + CNS --> CR + class CPRIM,CSEC region + class CWP,CWP2 pool +``` + +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph FPRIM["Primary (outage)"] + subgraph FPP["Worker Pool"] + FPW1["Unavailable"]:::down + end + FPN["Namespace"]:::down + FPP ~~~ FPN + end + subgraph FSEC["Secondary"] + FSN["Namespace
(Active)"]:::ns + subgraph FSP["Worker Pool"] + FSW1["Worker
Cold start"]:::worker + FSW2["Worker
Cold start"]:::worker + FSW3["Worker
Cold start"]:::worker + end + FSN <-->|Workflows| FSP + end + FPN -->|"Failover"| FSN + class FPRIM regiondown + class FSEC region + class FPP,FSP pool +``` + +--- + +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | + +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph HPRIM["Primary"] + subgraph HWP["Worker Pool"] + HW1["Worker
Active"]:::worker + HW2["Worker
Active"]:::worker + HW3["Worker
Active"]:::worker + end + HNS["Namespace"]:::ns + HWP <-->|Workflows| HNS + end + subgraph HSEC["Secondary"] + HR["Replica"]:::ns + subgraph HWP2["Worker Pool"] + HS1["Worker
Standby"]:::worker + HS2["Worker
Standby"]:::worker + HS3["Worker
Standby"]:::worker + end + HR <-->|"Connected"| HWP2 + end + HNS --> HR + class HPRIM,HSEC region + class HWP,HWP2 pool +``` + +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph HFPRIM["Primary (outage)"] + subgraph HFPP["Worker Pool"] + HFPW1["Unavailable"]:::down + end + HFPN["Namespace"]:::down + HFPP ~~~ HFPN + end + subgraph HFSEC["Secondary"] + HFSN["Namespace
(Active)"]:::ns + subgraph HFSP["Worker Pool"] + HFSW1["Worker
Active"]:::worker + HFSW2["Worker
Active"]:::worker + HFSW3["Worker
Active"]:::worker + end + HFSN <-->|Workflows| HFSP + end + HFPN -->|"Failover"| HFSN + class HFPRIM regiondown + class HFSEC region + class HFPP,HFSP pool +``` + +--- + +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no standby Workers) | Cross-region requests add Workflow latency | + +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph APRIM["Primary"] + subgraph AWP["Worker Pool"] + AW1["Worker
Active"]:::worker + AW2["Worker
Active"]:::worker + end + ANS["Namespace"]:::ns + AWP <-->|Workflows| ANS + end + subgraph ASEC["Secondary"] + AR["Replica"]:::ns + subgraph AWP2["Worker Pool"] + AS1["Worker
Active"]:::worker + AS2["Worker
Active"]:::worker + end + AR <-->|Workflows| AWP2 + end + ANS --> AR + class APRIM,ASEC region + class AWP,AWP2 pool +``` + +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph AFPRIM["Primary (outage)"] + subgraph AFPP["Worker Pool"] + AFPW1["Unavailable"]:::down + end + AFPN["Namespace"]:::down + AFPP ~~~ AFPN + end + subgraph AFSEC["Secondary"] + AFSN["Namespace
(Active)"]:::ns + subgraph AFSP["Worker Pool"] + AFSW1["Worker
Active"]:::worker + AFSW2["Worker
Active"]:::worker + AFSW3["Worker
Scaled up
(as needed)"]:::workerhollow + end + AFSN <-->|Workflows| AFSP + end + AFPN -->|"Failover"| AFSN + class AFPRIM regiondown + class AFSEC region + class AFPP,AFSP pool +``` + +### Active / Passive (Cold) {/* #active-cold */} + +_Also known as "Active / Cold Standby", "Active / Cold", or simply "Active / Passive"._ + +Active / Cold Pattern: **Normal operation** + +- **Workers run in only one region.** A single Worker fleet runs in the primary region and processes all Workflows. No Workers run in the secondary region. +- **The Namespace replicates to the secondary region.** A Namespace with High Availability has an active replica in the primary region and a passive replica in the secondary region. Temporal Cloud continuously replicates Workflow state to the passive replica, so it stays ready to become active. +- **Your databases and queues replicate too, if needed.** Workers read and write systems such as databases and queues. If your Workflows depend on that data, replicate it to the secondary region so it's available after a failover. Workflows that don't touch external state may not need this. +- **Setup is minimal.** Turn on Replication for your Namespace (see [High Availability for Temporal Cloud Namespaces](/cloud/high-availability)) and enable replication on any databases or queues your Workflows use. At that point you're technically already running Active / Passive (Cold): the secondary region holds a ready replica, and failing over is a matter of bringing your Workers up there. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph DCNPRI["Primary"] + DCNWP["Workers"]:::worker + DCNCODEC["Codec Server"]:::ext + DCNNS["Namespace"]:::ns + DCNDB[("DB / queue")]:::ext + DCNWP <-->|Workflows| DCNNS + DCNWP <--> DCNDB + DCNWP <--> DCNCODEC + end + subgraph DCNSEC["Secondary"] + DCNR["Replica"]:::ns + DCNDB2[("DB / queue")]:::ext + DCNR ~~~ DCNDB2 + end + DCNNS -. replicates .-> DCNR + DCNDB <-.->|"replication (if needed)"| DCNDB2 + class DCNPRI,DCNSEC region +``` + +Active / Cold Pattern: **On failover** + +- **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself. +- **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the secondary region's copy active so the new Workers can read and write it. +- **Recovery time is dominated by Worker startup.** After Temporal detects the outage and triggers failover, the Namespace is active almost immediately, but throughput returns to normal only after container or VM startup, image pulls, and application warm-up complete. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + subgraph DCFPRI["Primary (outage)"] + DCFNS["Namespace"]:::down + DCFDB[("DB / queue
Unavailable")]:::down + DCFNS ~~~ DCFDB + end + subgraph DCFSEC["Secondary"] + DCFN["Namespace
(Active)"]:::ns + DCFSP["Workers
Cold start"]:::worker + DCFCODEC["Codec Server
Cold start"]:::ext + DCFDB2[("DB / queue
Promoted")]:::ext + DCFN <-->|Workflows| DCFSP + DCFSP <--> DCFDB2 + DCFSP <--> DCFCODEC + end + DCFPRI -->|"Failover"| DCFSEC + class DCFPRI regiondown + class DCFSEC region +``` + +Active / Cold Pattern: **Benefits** + +- **Easy to reason about.** + - Only one region is active at a time, so traffic routing and interactions with systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- **Simple to operate.** + - During normal operation it resembles a single-region deployment. +- **Lowest overall architecture cost.** + - The size of the Worker fleet is simply the capacity needed to operate in one region. There are no standby Workers during steady state. + +Active / Cold Pattern: **Tradeoffs** + +- Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover. +- Depends on tested automation to bring up the secondary-region fleet quickly. + +Active / Cold Pattern: **Recommendations and important constraints** + +- **Failing over the Workers is the operator's responsibility.** The Namespace fails over automatically, but bringing up the Workers in the secondary region is up to you. Plan for these sub-considerations: + - **How do you detect an outage and decide to fail over?** Define the failover conditions and the signals (alerts, health checks) that trigger them. + - **How do you scale up the Workers?** Bring up the secondary-region fleet, ideally with tested automation, and scale down the primary region's fleet so Workers run in only one region at a time. + - **Do you need to enforce single-region processing?** The Cold pattern relies on the operator to keep Workers in one region. To have Temporal enforce single-region processing instead, use the [Active / Passive (Hot)](#active-hot) pattern. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':30,'curve':'basis'}}}%% +flowchart LR + FS1["Detect
outage"] + FS3["Failover
DBs / queues"] + FS4["Scale down
Primary region
Workers"] + FS5["Scale up
Secondary region
Workers"] + FS6["Confirm
Workflows run
normally"] + FS1 --> FS3 --> FS4 --> FS5 --> FS6 +``` + +- **Use the Namespace Endpoint.** + - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region. + - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + NEW["Worker"]:::worker + NEEP["Namespace Endpoint"]:::endpoint + NEW --> NEEP + subgraph NEPRIM["Primary"] + NEPNS["Namespace"]:::ns + end + subgraph NESEC["Secondary"] + NESNS["Namespace"]:::ns + end + NEEP -->|"normal operation"| NEPNS + NEEP -.->|"after failover"| NESNS + class NEPRIM,NESEC region +``` + +- **Set up cross-region private connectivity.** + - If you use private connectivity, give the primary region's Workers a network route to the VPC Endpoint in the other region, so they can reach the active replica after a Namespace-only failover. If you can't provide that cross-region route, use the [Active / Passive (Hot)](#active-hot) pattern instead, where each region's Workers connect to their local replica. + - For the full setup of Regional Endpoints, VPC Endpoints, and cross-region routing, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph CRCPRIM["Primary Region"] + CRCPW["Worker"]:::worker + CRCPEP["VPC Endpoint"]:::endpoint + CRCPNS["Namespace"]:::ns + end + subgraph CRCSEC["Secondary Region"] + CRCSEP["VPC Endpoint"]:::endpoint + CRCSNS["Replica"]:::ns + end + CRCPW -->|"normal operation"| CRCPEP + CRCPEP --> CRCPNS + CRCSEP --> CRCSNS + CRCPW -.->|"after a Namespace failover"| CRCSEP + class CRCPRIM,CRCSEC region +``` + +- **Route Workers to the active region's Codec Server.** Two common approaches: + - Put DNS or a load balancer in front of the Codec Server address, and update it on failover to point at the new region's instance. + - Pass each Worker the Codec Server address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + CSW["Worker"]:::worker + subgraph CSPRIM["Primary"] + CSPC["Codec Server"]:::ext + end + subgraph CSSEC["Secondary"] + CSSC["Codec Server"]:::ext + end + CSW -->|"normal operation"| CSPC + CSW -.->|"after failover"| CSSC + class CSPRIM,CSSEC region +``` + +- **Route Workers to the active region's proxy.** Two common approaches: + - Put DNS or a load balancer in front of the proxy address, and update it on failover to point at the new region's instance. + - Pass each Worker the proxy address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + PXW["Worker"]:::worker + subgraph PXPRIM["Primary"] + PXPP["Proxy"]:::endpoint + end + subgraph PXSEC["Secondary"] + PXSP["Proxy"]:::endpoint + end + PXW -->|"normal operation"| PXPP + PXW -.->|"after failover"| PXSP + class PXPRIM,PXSEC region +``` + +Active / Cold Pattern: **Component behavior** + +- **Workers** — run only in the primary region; brought up in the secondary region during a failover. +- **Workflow starters and Clients** — run with the Workers; brought up in the secondary region during a failover. +- **Codec Servers and proxies** — run alongside the active Workers; scaled up in the secondary region as part of a failover. +- **Databases and queues** — single-region-active; fail over to the secondary region alongside the Workers. + +### Active / Passive (Hot) {/* #active-hot */} + +_Also known as "Active / Hot Standby" or "Active / Hot"._ + +Active / Hot Pattern: **Normal operation** + +- **Workers run in both regions.** A full Worker fleet runs in each region. The primary region's Workers are active and process all Workflows; the secondary region's Workers stay connected and warm, but on standby, doing no work. +- **Workflows process in only one region at a time.** The Namespace has a single active replica, so even though Workers run in both regions, Workflows execute only in the active (primary) region. +- **Forwarding is disabled for Worker polls.** Each fleet connects to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity) with forwarding off, so polls that reach the passive replica are not sent to the active region. The standby fleet does no work and adds no cross-region overhead. +- **The Namespace replicates to the secondary region.** A Namespace with High Availability keeps an active replica in the primary region and a passive replica in the secondary region, continuously replicating Workflow state so the standby is ready to take over. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DHNPRI["Primary"] + DHNWP["Workers
Active"]:::worker + DHNNS["Namespace"]:::ns + DHNDB[("DB / queue")]:::ext + DHNCODEC["Codec Server"]:::ext + DHNWP <-->|Workflows| DHNNS + DHNWP <--> DHNDB + DHNWP <--> DHNCODEC + end + subgraph DHNSEC["Secondary"] + DHNR["Replica"]:::ns + DHNWP2["Workers
Standby"]:::worker + DHNDB2[("DB / queue
Standby")]:::ext + DHNCODEC2["Codec Server"]:::ext + DHNR <-->|"Connected"| DHNWP2 + DHNWP2 <--> DHNDB2 + DHNWP2 <--> DHNCODEC2 + end + DHNNS -. replicates .-> DHNR + class DHNPRI,DHNSEC region +``` + +Active / Hot Pattern: **On failover** + +- **The Namespace and Workers fail over together, automatically.** When the primary region fails, Temporal Cloud promotes the secondary replica to active, and the secondary region's standby Workers — already connected and warm — begin processing immediately. +- **No cold start and no DNS wait.** Because a full Worker fleet was already running in the secondary region, there's nothing to start or scale up before processing resumes. This pattern achieves the lowest recovery time of the three. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the secondary region's copy active so the now-active Workers can read and write it. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DHFPRI["Primary (outage)"] + DHFWP["Workers
Unavailable"]:::down + DHFNS["Namespace"]:::down + DHFWP ~~~ DHFNS + end + subgraph DHFSEC["Secondary"] + DHFN["Namespace
(Active)"]:::ns + DHFSP["Workers
Active"]:::worker + DHFDB2[("DB / queue
Promoted")]:::ext + DHFCODEC["Codec Server"]:::ext + DHFN <-->|Workflows| DHFSP + DHFSP <--> DHFDB2 + DHFSP <--> DHFCODEC + end + DHFNS -->|"Failover"| DHFN + class DHFPRI regiondown + class DHFSEC region +``` + +Active / Hot Pattern: **Benefits** + +- **Easy to reason about.** + - Only one region is active at a time, so traffic routing and interactions with systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- **Lowest overall recovery time of the three patterns.** + - The secondary-region Workers are already connected and warm, so failover involves no cold start. +- **Low latency during normal operation.** + - Tasks are processed only in the active region, with no cross-region forwarding. + +Active / Hot Pattern: **Tradeoffs** + +- Highest overall architecture cost: a full standby Worker fleet runs in the secondary region at all times, even during steady state. + +Active / Hot Pattern: **Recommendations and important constraints** + +- **Use Regional or VPC Endpoints and disable forwarding.** + - Connect each Worker fleet through its region's [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) (or VPC Endpoint) and [disable forwarding](/cloud/high-availability/enable#change-forwarding-behavior) for Worker polls. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph HEPRIM["Primary"] + HEPW["Workers
Active"]:::worker + HEPEP["Regional / VPC Endpoint"]:::endpoint + HEPNS["Namespace"]:::ns + HEPW --> HEPEP + HEPEP --> HEPNS + end + subgraph HESEC["Secondary"] + HESW["Workers
Standby"]:::worker + HESEP["Regional / VPC Endpoint"]:::endpoint + HESNS["Replica"]:::ns + HESW --> HESEP + HESEP --> HESNS + end + HEPNS -. replicates .-> HESNS + class HEPRIM,HESEC region +``` + +Active / Hot Pattern: **Component behavior** + +- **Workers** — run in both regions; only the active region processes Workflows. +- **Workflow starters and Clients** — run in both regions alongside the Workers. +- **Codec Servers and proxies** — run in both regions continuously, not just after a failover. +- **Databases and queues** — typically single-region-active; fail over alongside the active Workers. + +### Active / Active {/* #active-active */} + +Active / Active Pattern: **Normal operation** + +- **Workers run and process in both regions at once.** A full Worker fleet runs in each region, and both fleets process Workflows concurrently. +- **The Namespace still has a single active replica.** A Temporal Cloud Namespace is not "active/active" in the database sense — one region holds the active replica and the other holds a passive replica. Forwarding is left enabled (the default). +- **The passive region forwards polls to the active replica.** Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls cross regions to reach the active replica, which adds some latency. +- **Roughly half the fleet runs in each region.** Total capacity is split across the two regions during steady state, with no dedicated standby fleet. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DANPRI["Primary"] + DANWP["Workers
Active"]:::worker + DANNS["Namespace"]:::ns + DANDB[("DB / queue")]:::ext + DANCODEC["Codec Server"]:::ext + DANWP <-->|Workflows| DANNS + DANWP <--> DANDB + DANWP <--> DANCODEC + end + subgraph DANSEC["Secondary"] + DANR["Replica"]:::ns + DANWP2["Workers
Active"]:::worker + DANDB2[("DB / queue")]:::ext + DANCODEC2["Codec Server"]:::ext + DANWP2 <-->|Workflows| DANR + DANWP2 <--> DANDB2 + DANWP2 <--> DANCODEC2 + end + DANNS -. replicates .-> DANR + DANR ==>|"forwards polls"| DANNS + class DANPRI,DANSEC region +``` + +Active / Active Pattern: **On failover** + +- **The surviving region keeps processing.** When one region fails, the other region's Workers are already active and processing, so Workflows continue running with no cold-start gap. +- **The Namespace fails over to the surviving region.** Temporal Cloud promotes the surviving region's replica to active; its local Workers then process against it without forwarding polls across regions. +- **Scale up capacity in the surviving region.** Each region normally runs only about half the fleet, so add capacity in the surviving region to handle the full workload at full throughput. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the surviving region's copy active so the Workers there can read and write it. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DAFPRI["Primary (outage)"] + DAFWP["Workers
Unavailable"]:::down + DAFNS["Namespace"]:::down + DAFWP ~~~ DAFNS + end + subgraph DAFSEC["Secondary"] + DAFN["Namespace
(Active)"]:::ns + DAFSP["Workers
Active,
scaled up as needed"]:::worker + DAFDB2[("DB / queue")]:::ext + DAFCODEC["Codec Server"]:::ext + DAFN <-->|Workflows| DAFSP + DAFSP <--> DAFDB2 + DAFSP <--> DAFCODEC + end + DAFNS -->|"Failover"| DAFN + class DAFPRI regiondown + class DAFSEC region +``` + +Active / Active Pattern: **Benefits** + +- **Low overall recovery time.** + - The surviving region keeps processing while capacity scales up. +- **Moderate overall architecture cost.** + - Roughly half the fleet runs in each region during steady state, with no dedicated standby fleet. + +Active / Active Pattern: **Tradeoffs** + +- The secondary region pays cross-region latency, because its polls are forwarded to the active replica. This can be a problem for latency-sensitive Workflows. +- Synchronizing external systems is harder, because Workers are active in both regions at once. + +Active / Active Pattern: **Recommendations and important constraints** + +- **Keep forwarding enabled.** + - Leave forwarding on (the default) so the secondary-region Workers' polls reach the active replica. Do not set `disablePassivePollerForwarding`. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph AAPRIM["Primary"] + AAPW["Workers"]:::worker + AAPNS["Namespace
(Active)"]:::ns + AAPW <-->|Workflows| AAPNS + end + subgraph AASEC["Secondary"] + AASW["Workers"]:::worker + AASR["Replica"]:::ns + AASW <-->|Workflows| AASR + end + AASR -.->|"forwards polls"| AAPNS + class AAPRIM,AASEC region +``` + +Active / Active Pattern: **Component behavior** + +- **Workers** — run and process in both regions; the secondary region's polls are forwarded to the active replica. +- **Workflow starters and Clients** — run in both regions. +- **Codec Servers and proxies** — run in both regions continuously. +- **Databases and queues** — accessed from both regions; cross-region consistency must be designed for. + +### Dual Active (Multi-Active) {/* #dual-active */} + +Dual Active Pattern: **Normal operation** + +Beyond the three main patterns, some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart TD + classDef appA fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef appB fill:#0EA5E922,stroke:#0EA5E9,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph DUNR2["Region 2"] + DUNWPB["App B Workers"]:::appB + DUNNBa["Namespace B"]:::appB + DUNNAp["Namespace A
Replica"]:::appA + DUNWPB <-->|Workflows| DUNNBa + end + subgraph DUNR1["Region 1"] + DUNWPA["App A Workers"]:::appA + DUNNAa["Namespace A"]:::appA + DUNNBp["Namespace B
Replica"]:::appB + DUNWPA <-->|Workflows| DUNNAa + end + DUNNAa -. replicates .-> DUNNAp + DUNNBa -. replicates .-> DUNNBp + linkStyle 0 stroke:#0EA5E9,stroke-width:1.5px + linkStyle 1 stroke:#7C3AED,stroke-width:1.5px + linkStyle 2 stroke:#7C3AED,stroke-width:1.5px + linkStyle 3 stroke:#0EA5E9,stroke-width:1.5px + class DUNR1,DUNR2 region +``` + +Dual Active Pattern: **On failover (Region 1 outage)** + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart TD + classDef appA fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef appAhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef appB fill:#0EA5E922,stroke:#0EA5E9,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + subgraph DUFR2["Region 2"] + DUFWPB["App B Workers"]:::appB + DUFNBa["Namespace B"]:::appB + DUFNAa2["Namespace A
(Active)"]:::appA + DUFWPA2["App A Workers
Hot or Cold Start"]:::appAhollow + DUFWPB <-->|Workflows| DUFNBa + DUFWPA2 <-->|Workflows| DUFNAa2 + end + subgraph DUFR1["Region 1 (outage)"] + DUFWPA["App A Workers
Unavailable"]:::down + DUFNAa["Namespace A"]:::down + DUFNBp["Namespace B
Replica"]:::down + DUFWPA ~~~ DUFNAa + end + DUFNAa -->|"Failover"| DUFNAa2 + linkStyle 0 stroke:#0EA5E9,stroke-width:1.5px + linkStyle 1 stroke:#7C3AED,stroke-width:1.5px + linkStyle 3 stroke:#7C3AED,stroke-width:1.5px + class DUFR1 regiondown + class DUFR2 region +``` + +Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. The same idea extends across more than two regions. Each Namespace fails over independently, following the Active / Passive sequence. + +Workloads on Temporal rarely need this. It pays off only when a workload is *both* extremely latency-sensitive across several same-continent regions *and* needs multi-region disaster recovery, an uncommon combination. + +Dual Active Pattern: **Benefits** + +- **Low-latency, region-bound data in each region.** + - Served from each region's active Namespace during normal operation. +- **Independent failover.** + - Each Namespace fails over independently, like Active / Passive. + +Dual Active Pattern: **Tradeoffs** + +- Highest overall architecture cost and operational complexity: two Worker fleets and two Namespaces. +- Rarely justified. Temporal recommends treating each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. + +Dual Active Pattern: **Component behavior** + +- **Workers** — one fleet per application, each active in its Namespace's region. +- **Workflow starters and Clients** — run with each application's Workers. +- **Codec Servers and proxies** — run in both regions, for both Namespaces. +- **Databases and queues** — region-bound per application; each fails over with its Namespace. + +## The rest of the architecture {/* #rest-of-architecture */} + +The Worker deployment pattern sets the approach; the supporting pieces follow it. + +- **Workflow starters and Clients.** Deploy these with the same regional pattern as the Workers, since a starter or Client often shares the same in-region dependencies (databases, queues, upstream services) and should fail over alongside them. Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover, and use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when a Client must be pinned to a region. +- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active patterns, run them in both regions at all times. +- **Databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker deployment pattern: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. + +## Serverless Workers failover {/* #serverless-workers-failover */} + +In every pattern above, the Worker fleet is something you run, so failing it over — a cold start, a standby fleet, or a second active region — is the application's responsibility. [Serverless Workers](/develop/typescript/workers/serverless-workers) move that responsibility to Temporal Cloud. + +Instead of long-lived Workers that poll a Task Queue, Serverless Workers invert the model: Temporal Cloud pushes Task invocations to a customer-owned compute function (AWS Lambda today). Because Temporal Cloud is the component that starts the Workers, it can also start them in the secondary region after a failover, with no action from you. + +- **One Worker Deployment spans both regions.** You register a compute function per region under a single Build ID, so the deployment is ready to run in either region. +- **Failover is automatic.** When the Namespace fails over, Temporal Cloud invokes the function in the new active region — there's no fleet to detect the outage and bring up. +- **The whole system fails over hands-off.** Both the Namespace and the Workers move automatically, lowering overall recovery time by removing the manual Worker-failover step that the patterns above require. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + subgraph SWPRI["Primary (outage)"] + SWPNS["Namespace"]:::down + end + subgraph SWSEC["Secondary"] + SWSNS["Namespace
(Active)"]:::ns + SWSL["Serverless Workers
(Lambda)"]:::worker + SWSNS -->|"Temporal Cloud
starts Workers"| SWSL + end + SWPNS -->|"Failover"| SWSNS + class SWPRI regiondown + class SWSEC region +``` + +On failover, Temporal Cloud promotes the secondary replica to active and invokes the Worker function there — no fleet to bring up and nothing for you to do. The Worker failover is hands-off. + +## Related {/* #related */} + +To add a replica and turn on High Availability features, see [Enable and manage High Availability](/cloud/high-availability/enable). + +To choose between the Namespace Endpoint and Regional Endpoints and to set up private connectivity, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). + +To stop forwarding Worker polls to the active region for the Active / Passive (Hot) pattern, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + +To trigger and manage failovers, see [Failovers](/cloud/high-availability/failovers). + +To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto). diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index 97cb2c0c65..7b161c8e1c 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -137,6 +137,8 @@ Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activit Same-region replicas are not affected by this setting. +To deploy Worker fleets in both regions that stay on standby in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). + :::info To see which endpoints route to which replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). @@ -152,10 +154,10 @@ Use the [`temporal cloud namespace ha update`](/cli/command-reference/cloud/name ```bash temporal cloud namespace ha update \ --namespace . \ - --disable-passive-poller-forwarding true + --passive-poller-forwarding disabled ``` -Set the flag to `false` to re-enable forwarding. +Set the flag to `enabled` to re-enable forwarding. ### Set the forwarding behavior with the Cloud Ops API {/* #set-forwarding-curl */} diff --git a/docs/cloud/high-availability/failovers/manage.mdx b/docs/cloud/high-availability/failovers/manage.mdx index 42bf59eb05..36fad55930 100644 --- a/docs/cloud/high-availability/failovers/manage.mdx +++ b/docs/cloud/high-availability/failovers/manage.mdx @@ -178,6 +178,8 @@ the replica, the DNS redirection orchestrated by Temporal ensures that your exis Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and the responses back, so Workers keep running through a failover. +To choose where your Worker fleets run across regions, see [Deployment patterns for High Availability](/cloud/high-availability/deployment-patterns). + To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). diff --git a/docs/cloud/high-availability/ha-connectivity.mdx b/docs/cloud/high-availability/ha-connectivity.mdx index 0dd79c3262..58fd7779ee 100644 --- a/docs/cloud/high-availability/ha-connectivity.mdx +++ b/docs/cloud/high-availability/ha-connectivity.mdx @@ -93,6 +93,10 @@ To learn what forwarding does, see [Request forwarding](/cloud/high-availability To stop forwarding Worker polls on a Namespace, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). + +To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). + ## How to use PrivateLink with High Availability features :::tip diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index aec627a21e..0a90c6b300 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -105,6 +105,10 @@ To route Workers to the passive region's replica, see [How requests reach the re To disable passive region replica forwarding, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). + +To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). + ## Service levels and recovery objectives Namespaces using High Availability have a 99.99% [uptime SLA](/cloud/sla) with sub-1-minute [RPO](/cloud/rpo-rto) and 20-minute [RTO](/cloud/rpo-rto). For detailed information: diff --git a/sidebars.js b/sidebars.js index e825104eb0..d3e4167a91 100644 --- a/sidebars.js +++ b/sidebars.js @@ -1205,6 +1205,7 @@ module.exports = { }, items: [ 'cloud/high-availability/enable', + 'cloud/high-availability/deployment-patterns', 'cloud/high-availability/monitoring', { type: 'category',