From ee3347519b20cb486ddfdadcea7f575716909d52 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Wed, 10 Jun 2026 11:21:01 -0700 Subject: [PATCH 1/9] Disable forwarding setting for HA --- docs/cloud/high-availability/enable.mdx | 123 +++++++++++++++++- .../high-availability/failovers/index.mdx | 21 ++- .../high-availability/failovers/manage.mdx | 62 +++------ .../high-availability/ha-connectivity.mdx | 18 +++ docs/cloud/high-availability/index.mdx | 16 ++- docs/cloud/rto-rpo.mdx | 28 ++-- 6 files changed, 193 insertions(+), 75 deletions(-) diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index fdb33c5cc5..dc5e7a8d61 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -1,9 +1,9 @@ --- id: enable -title: Enable High Availability -sidebar_label: Enable High Availability +title: Enable and manage High Availability +sidebar_label: Enable and manage High Availability slug: /cloud/high-availability/enable -description: Get started with HA features +description: Add a replica to a Namespace to enable High Availability, then manage forwarding behavior and automatic failover settings. --- import { ToolTipTerm } from '@site/src/components'; @@ -127,6 +127,123 @@ Follow these steps to change the replica location: You will receive an email alert once your Namespace is ready for use. +## Change the forwarding behavior {/* #change-forwarding-behavior */} + +Requests that reach the passive replica can be [forwarded](/cloud/high-availability/#request-forwarding) to the active region, and responses sent back to the Worker or Client. The `disablePassivePollerForwarding` Namespace setting controls this behavior for Worker poll traffic. + +With `disablePassivePollerForwarding` enabled, Worker polls that reach a passive replica are not forwarded, and these Workers do not execute Workflows or Activities. Workers connected to such a passive replica receive a `NamespaceNotActive` error on poll requests. These Workers stay connected and will start executing Workflows and Activities if the replica becomes active. + +Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activity APIs) are forwarded to the active region regardless of this setting, with responses sent back to the Client. + +Same-region replicas are not affected by this setting. + +:::info + +To see which endpoints route to which replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). + +::: + +`disablePassivePollerForwarding` can be set through the [Cloud Ops API](/ops), the [cloud-api SDK](https://github.com/temporalio/cloud-sdk-go), or the [`temporal cloud` CLI extension](/cli/cloud). Use one of the recipes below. + +### Set the forwarding behavior with the `temporal cloud` CLI {/* #set-forwarding-cli */} + +Use the [`temporal cloud namespace ha update`](/cli/command-reference/cloud/namespace#ha-update) command: + +```bash +temporal cloud namespace ha update \ + --namespace . \ + --disable-passive-poller-forwarding true +``` + +Set the flag to `false` to re-enable forwarding. + +### Set the forwarding behavior with the Cloud Ops API {/* #set-forwarding-curl */} + +This recipe uses `curl` against the Cloud Ops API. Because `UpdateNamespace` replaces the entire Namespace spec, it fetches the current spec, merges in the new value, and posts it back with the current `resourceVersion`. The `jq` filter preserves any other High Availability fields you have set (such as `disableManagedFailover`). + +Set the Cloud Ops API key and Namespace ID. The Namespace ID is in `.` format (the full identifier shown in the Cloud Web UI): + +```bash +export TEMPORAL_CLOUD_OPS_API_KEY='' +export NS='.' +``` + +Fetch the current spec: + +```bash +curl -sS "https://saas-api.tmprl.cloud/cloud/namespaces/$NS" \ + -H "Authorization: Bearer $TEMPORAL_CLOUD_OPS_API_KEY" > /tmp/ns.json +``` + +Build the update payload. Set the value to `true` to disable forwarding, or `false` to restore the default: + +```bash +jq --arg rv "$(jq -r '.namespace.resourceVersion' /tmp/ns.json)" '{ + spec: (.namespace.spec + | .highAvailability = ((.highAvailability // {}) + {disablePassivePollerForwarding: true})), + resourceVersion: $rv +}' /tmp/ns.json > /tmp/ns-update.json +``` + +Post the update. The response contains an `asyncOperation` ID; the change is complete when `GET /cloud/operations/` reports a terminal state. + +```bash +curl -sS -X POST "https://saas-api.tmprl.cloud/cloud/namespaces/$NS" \ + -H "Authorization: Bearer $TEMPORAL_CLOUD_OPS_API_KEY" \ + -H "Content-Type: application/json" \ + -d @/tmp/ns-update.json +``` + +Verify the current value: + +```bash +curl -sS "https://saas-api.tmprl.cloud/cloud/namespaces/$NS" \ + -H "Authorization: Bearer $TEMPORAL_CLOUD_OPS_API_KEY" \ + | jq '.namespace.spec.highAvailability.disablePassivePollerForwarding' +``` + +A result of `null` means the field has never been set, which is equivalent to `false` — proto3 JSON omits default-`false` values from responses. + +## Enable or disable automatic failovers {/* #automatic-failovers */} + +When a Temporal Cloud Namespace has a replica in a different region or cloud, Temporal Cloud automatically fails over the Namespace to its replica in the event of an outage. _This is the recommended and default option._ + +If you prefer to disable automatic failovers and handle your own failovers, follow these instructions: + +:::warning Disabling automatic failovers voids Temporal's RTO + +With automatic failovers disabled, Temporal Cloud cannot fail your Namespace over to its replica during an outage. You take responsibility for detecting outages and [triggering a failover](/cloud/high-availability/failovers/manage#trigger-failover) yourself. Temporal's [20-minute RTO](/cloud/rpo-rto) does not apply while this setting is disabled. + +::: + + + + + +1. Navigate to the Namespace detail page in Temporal Cloud. +1. Choose the "Disable Temporal-initiated failovers" option. + + + + + +To disable automatic failovers, run the following command in your terminal: + +``` +tcld namespace update-high-availability \ + --namespace . \ + --disable-auto-failover=true +``` + +If using API key authentication with the `--api-key` flag, you must add it directly after the tcld command and before +`namespace update-high-availability`. + + + + + +To restore the default behavior, unselect the option in the Web UI or change `true` to `false` in the CLI command. + ## Disable High Availability (remove a replica) {/* #disable */} To disable High Availability features on a Namespace, remove the replica from that Namespace. Removing a replica diff --git a/docs/cloud/high-availability/failovers/index.mdx b/docs/cloud/high-availability/failovers/index.mdx index 70d6f98cf4..e4af4c6421 100644 --- a/docs/cloud/high-availability/failovers/index.mdx +++ b/docs/cloud/high-availability/failovers/index.mdx @@ -24,19 +24,18 @@ When a Namespace with [High Availability](/cloud/high-availability) is disrupted over the Namespace from the primary to the replica. This lets in-flight Workflow Executions continue, new Workflow Executions start, and closed Workflow Executions be inspected, all with minimal interruptions or data loss. -Returning control from the replica to the primary is called a . After a Temporal-managed +Returning control from the replica to the primary is called a . After an automatic failover, Temporal automatically fails back to the original region once it is healthy, unless you -[opt out](/cloud/high-availability/failovers/manage#after-a-temporal-managed-failover). See +[opt out](/cloud/high-availability/failovers/manage#after-an-automatic-failover). See [Failbacks](/cloud/high-availability/failovers/manage#failbacks) for details. ## Automatic failover Temporal Cloud offers managed outage detection and failover to all Namespaces that use High Availability. -Temporal-managed failovers, also known as "automatic failovers," keep your Namespace available without manual -intervention. Temporal aims to both detect the outage and complete a failover in minutes from when the outage began, +These automatic failovers keep your Namespace available without manual intervention. Temporal aims to both detect the outage and complete a failover in minutes from when the outage began, according to the stated [Recovery Time Objective (RTO)](/cloud/rpo-rto). -After a Temporal-managed failover, the Namespace will have a replica in its original region. Once the original region is +After an automatic failover, the Namespace will have a replica in its original region. Once the original region is healthy again, Temporal Cloud automatically performs a [failback](/cloud/high-availability/failovers/manage#failbacks), moving the Namespace back to its original region. @@ -45,8 +44,8 @@ moving the Namespace back to its original region. title="On failover, the replica becomes active and the Namespace endpoint directs access to it." /> -To opt out of Temporal-managed failovers and their RTO, you can -[disable automated failovers](/cloud/high-availability/failovers/manage#disabling-temporal-initiated). +To opt out of automatic failovers and their RTO, you can +[disable automatic failovers](/cloud/high-availability/enable#automatic-failovers). ### Conditions that trigger an automatic failover @@ -65,7 +64,7 @@ regional outage. :::info -The following list gives a general idea of the conditions that trigger a Temporal-managed failover. This is not an +The following list gives a general idea of the conditions that trigger an automatic failover. This is not an exhaustive list, and it may change over time. ::: @@ -81,7 +80,7 @@ exhaustive list, and it may change over time. You can also [manually trigger a failover](/cloud/high-availability/failovers/manage#trigger-failover) based on your own monitoring or for failover testing. -Most Namespaces with High Availability are well-served by Temporal-managed failovers. The cases where a manual failover +Most Namespaces with High Availability are well-served by automatic failovers. The cases where a manual failover is warranted are: - **Testing failover or migrating to a new region.** A manual failover is the standard way to exercise your failover @@ -89,7 +88,7 @@ is warranted are: - **An outage that affects only your systems.** If an outage is contained to your application, Workers, or other infrastructure, and Temporal Cloud is not affected, Temporal will not initiate a failover on your behalf. Detect the outage with your own monitoring and trigger a failover yourself. -- **Failing over more aggressively during a regional outage.** Even with Temporal-managed failovers enabled, you can +- **Failing over more aggressively during a regional outage.** Even with automatic failovers enabled, you can trigger a failover yourself if you detect a regional outage before Temporal does. Whichever failover happens first takes effect, and the later one is a no-op. A user-triggered failover does not conflict with Temporal's automatic failover. @@ -124,7 +123,7 @@ The failover process is the same whether it is triggered automatically by Tempor After any failover, whether triggered by you or by Temporal, an event appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs. The audit log entry uses the `"operation": "FailoverNamespace"` event. Temporal Cloud [notifies you via email](/cloud/notifications#admin-notifications) whenever a failover occurs. -After a Temporal-managed failover, Temporal automatically fails back to the original region once the region is healthy, unless you [opt out](/cloud/high-availability/failovers/manage#after-a-temporal-managed-failover). After a user-triggered failover, the Namespace stays in the replica region until a user triggers another failover. See [failback options](/cloud/high-availability/failovers/manage#failbacks) for details. +After an automatic failover, Temporal automatically fails back to the original region once the region is healthy, unless you [opt out](/cloud/high-availability/failovers/manage#after-an-automatic-failover). After a user-triggered failover, the Namespace stays in the replica region until a user triggers another failover. See [failback options](/cloud/high-availability/failovers/manage#failbacks) for details. ## Split-brain scenario diff --git a/docs/cloud/high-availability/failovers/manage.mdx b/docs/cloud/high-availability/failovers/manage.mdx index f204a3d5de..76da3a7766 100644 --- a/docs/cloud/high-availability/failovers/manage.mdx +++ b/docs/cloud/high-availability/failovers/manage.mdx @@ -131,82 +131,54 @@ on-call team is automatically paged to intervene and force the failover to compl Failback behavior depends on whether the failover was automatic or manually triggered. -### After a Temporal-managed failover +### After an automatic failover {/* #after-an-automatic-failover */} -After a Temporal-managed failover, Temporal Cloud automatically fails back to the original region once the region is +After an automatic failover, Temporal Cloud automatically fails back to the original region once the region is healthy. No action is required from you. Follow [Temporal's status page](https://status.temporal.io) for updates on the original region's health. If you prefer to manage failback yourself, you have two options: - **Opt out of automatic failback (manage failback manually):** After the automatic failover has completed, - [disable Temporal-managed failovers](#disabling-temporal-initiated) on the Namespace to prevent Temporal from - automatically failing back. When you're ready to return to the original region, - [trigger a failover](#trigger-failover) to that region and then re-enable Temporal-managed failovers. + [disable automatic failovers](/cloud/high-availability/enable#automatic-failovers) on the Namespace to prevent + Temporal from automatically failing back. When you're ready to return to the original region, + [trigger a failover](#trigger-failover) to that region and then re-enable automatic failovers. - **Stay on the new region permanently ("fail forward"):** After the automatic failover has completed, [trigger a failover](#trigger-failover) to the region that is already active. This tells Temporal that you want to - treat the new region as your primary for as long as it's healthy. Temporal-managed automatic failovers remain enabled, + treat the new region as your primary for as long as it's healthy. Automatic failovers remain enabled, so Temporal will still protect you if the new region has an outage. ### After a user-triggered failover -If you triggered a failover yourself during an outage (instead of relying on a Temporal-managed failover), Temporal will +If you triggered a failover yourself during an outage (instead of relying on an automatic failover), Temporal will _not_ automatically fail back for you. You must [trigger a failover](#trigger-failover) back to the original region when it is healthy. Monitor [Temporal's status page](https://status.temporal.io) for updates on region health. -Automatic failback is only available after Temporal-managed (automatic) failovers. +Automatic failback is only available when the most recent failover was automatic. ### How to check whether your Namespace will be automatically failed back If you are not sure whether your Namespace will be automatically failed back, check the list of failovers in the Temporal Cloud Web UI on your Namespace's detail page: -- If the most recent failover was **Temporal-triggered**, then Temporal will automatically fail back the Namespace when +- If the most recent failover was **automatic**, then Temporal will fail the Namespace back when the original region is healthy. - If the most recent failover was **user-triggered**, then the Namespace will _not_ be automatically failed back. You must trigger the failback yourself. -## Disable Temporal-initiated failovers {/* #disabling-temporal-initiated */} - -When you add a replica to a Namespace, Temporal Cloud automatically fails over the Namespace to its replica in the event -of an outage. _This is the recommended and default option._ - -If you prefer to disable Temporal-initiated failovers and handle your own failovers, follow these instructions: - - - - - -1. Navigate to the Namespace detail page in Temporal Cloud. -1. Choose the "Disable Temporal-initiated failovers" option. - - - - - -To disable Temporal-initiated failovers, run the following command in your terminal: - -``` -tcld namespace update-high-availability \ - --namespace . \ - --disable-auto-failover=true -``` - -If using API key authentication with the `--api-key` flag, you must add it directly after the tcld command and before -`namespace update-high-availability`. - - - - - -To restore the default behavior, unselect the option in the Web UI or change `true` to `false` in the CLI command. - ## Workers and failovers {/* #worker */} Enabling High Availability for Namespaces does not require specific Worker configuration. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the -Namespace without interruption. +Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and +the responses back, so Workers keep running through a failover. + +To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). + +To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + +To disable automatic failovers, see [Enable or disable automatic failovers](/cloud/high-availability/enable#automatic-failovers). When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region. diff --git a/docs/cloud/high-availability/ha-connectivity.mdx b/docs/cloud/high-availability/ha-connectivity.mdx index 33928d6ace..0dd79c3262 100644 --- a/docs/cloud/high-availability/ha-connectivity.mdx +++ b/docs/cloud/high-availability/ha-connectivity.mdx @@ -75,6 +75,24 @@ Consider a Namespace replicated across `us-east-1` (initially active) and `us-we The Namespace Endpoint moves with the active region via an updated CNAME — no Client changes required. The Regional Endpoints do not change their targets on failover: each continues to route to the replica that lives in its region. +## How requests reach the replica {/* #how-requests-reach-the-replica */} + +A request can reach the passive replica in three ways: + +- **Through the passive region's Regional Endpoint.** A [Regional Endpoint](#regional-endpoint) is pinned to its region, so the Regional Endpoint of the region that currently holds the passive replica connects to the passive replica. +- **Through a PrivateLink or Private Service Connect endpoint in the passive region.** A VPC Endpoint or PSC endpoint in the passive region routes to the passive replica. +- **Through the Namespace Endpoint during a failover.** When a Namespace fails over, two things happen in parallel: + 1. The replica becomes active, and the former active becomes a replica. + 2. The Namespace Endpoint changes to point at the replica's region, and the Worker re-resolves the Namespace Endpoint to connect to the new active region via DNS. + + If #1 completes before #2, a Worker that was connected to the former active before the failover will stay connected to it even after it becomes the replica. The Worker will change to point at the new active when the DNS changes propagate and the Client re-resolves DNS (typically 30 seconds, though up to 5 minutes, bounded by Temporal Cloud's maximum connection lifetime). + +By default, Temporal Cloud transparently forwards any request that reaches the passive replica to the active region, and the response back. + +To learn what forwarding does, see [Request forwarding](/cloud/high-availability/#request-forwarding). + +To stop forwarding Worker polls on a Namespace, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + ## How to use PrivateLink with High Availability features :::tip diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index 50d60b5707..69b2bfb115 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -3,7 +3,7 @@ id: index title: High Availability sidebar_label: High Availability slug: /cloud/high-availability -description: Temporal Cloud's Namespace with High Availability features offers automated failover, synchronized data, and replication for workloads requiring disaster-tolerant deployment and 99.99% uptime. +description: Temporal Cloud's Namespace with High Availability features offers automatic failover, synchronized data, and replication for workloads requiring disaster-tolerant deployment and 99.99% uptime. tags: - Temporal Cloud - Production @@ -59,7 +59,7 @@ High Availability features extend Temporal Cloud's replication across regions an - **Real-time replication** — Temporal replicates your Namespace across distant regions or cloud providers with no performance impact to your Workers or Workflows. - **Automatic failover with 20-minute RTO** — Temporal manages failover with a 20-minute [RTO](/cloud/rpo-rto). You can also [trigger failover](/cloud/high-availability/failovers) manually at any time, for example for testing. -- **Transparent DNS routing** — On failover, DNS reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) to the active region. Requests that reach the replica are forwarded to the active region automatically. +- **Transparent DNS routing** — On failover, DNS reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) to the active region. [Requests that reach the replica are forwarded to the active region automatically](#request-forwarding). - **Sub-1-minute RPO** — In a failover during an outage, the [Recovery Point Objective](/cloud/rpo-rto) is under one minute. - **Real-time lag monitoring** — Monitor your Namespace's replication lag in real time to understand your current RPO. - **Conflict resolution** — If the two regions are not fully in sync at the time of failover, Temporal's conflict resolution process reconciles discrepancies and ensures data integrity. @@ -93,6 +93,18 @@ Using physically separated regions improves the fault tolerance of your applicat ::: +## Request forwarding {/* #request-forwarding */} + +A Namespace with High Availability features replicates across two regions, with one replica active and one passive at any moment. The active replica accepts reads and writes; the passive replica receives replicated state asynchronously and stands ready for failover. + +When a request reaches the passive replica — for example, through the passive region's Regional Endpoint — Temporal Cloud forwards the request transparently to the active replica and the response back to the Worker. This allows Workers and Clients to connect to the passive region during healthy times, and when an outage hits, start processing Workflows immediately after a failover. + +Forwarding adds a cross-region hop, so requests that travel through the passive replica complete with higher average latency than requests that reach the active replica directly. + +To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). + +To disable passive region replica forwarding, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + ## Service levels and recovery objectives Namespaces using High Availability have a 99.99% [uptime SLA](/cloud/sla) with sub-1-minute [RPO](/cloud/rpo-rto) and 20-minute [RTO](/cloud/rpo-rto). For detailed information: diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index 505381ef88..a228d38472 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -38,9 +38,9 @@ features the Namespace has enabled. ## RTO and RPO summary The following table summarizes the RTO and RPO targets for each type of outage. These targets apply to Namespaces that -have Temporal-initiated failovers enabled, which is the default. Temporal-initiated failovers are triggered by +have automatic failovers enabled, which is the default. Automatic failovers are triggered by Temporal's tooling and on-call engineers without user action. Users can always initiate a failover independently. In an -outage, a user-initiated failover will not cancel out or reverse a Temporal-initiated failover. +outage, a user-initiated failover will not cancel out or reverse an automatic failover. These targets are for unplanned cloud outages and do not apply to user-initiated failovers during healthy periods, such as DR drills. Read about [triggering a failover](/cloud/high-availability/failovers/manage#trigger-failover) to see how a Namespace failover @@ -55,7 +55,7 @@ performs during healthy periods. :::tip -Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are +Temporal highly recommends keeping automatic failovers enabled. When automatic failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover. @@ -112,7 +112,7 @@ that case, Namespaces in the region may be impacted, including those using When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by -[disabling Temporal-managed failovers](/cloud/high-availability/failovers/manage#disabling-temporal-initiated) on the Namespace. +[disabling automatic failovers](/cloud/high-availability/enable#automatic-failovers) on the Namespace. ::: @@ -151,7 +151,7 @@ Namespaces in real-world incidents. #### RTO and RPO -When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover: +When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for automatic failover: - **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. - **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. @@ -171,7 +171,7 @@ fail over and continue serving Workflows. Same-region Replication does not prote the replica resides in the same region. **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-region Replication -or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. +or Multi-cloud Replication enabled with automatic failovers — in those cases, Temporal can mitigate the outage. For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. @@ -184,7 +184,7 @@ Temporal Cloud's regional failover kept customer Namespaces running. #### RTO and RPO -When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover: +When using Multi-region Replication or Multi-cloud Replication for automatic failover: - **RTO under 20 minutes.** Temporal detects the regional disruption and fails the Namespace over to its replica in another region. @@ -209,7 +209,7 @@ is potentially affected. entirely, so the Namespace can fail over even when an entire cloud provider goes down. **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-cloud Replication -enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this +enabled with automatic failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. Cloud-wide outages are the rarest category, but they @@ -218,7 +218,7 @@ keep Namespaces running through such events. #### RTO and RPO -When using Multi-cloud Replication for Temporal-managed failover: +When using Multi-cloud Replication for automatic failover: - **RTO under 20 minutes.** Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a different cloud provider. @@ -271,11 +271,11 @@ failing over any other regional dependencies your application relies on, such as To achieve the lowest possible recovery times, Temporal recommends that you: -- Keep Temporal-initiated failovers enabled on your Namespace (the default) +- Keep automatic failovers enabled on your Namespace (the default) - Invest in a process to detect outages and trigger a manual failover. -You can trigger manual failovers on your Namespaces even if Temporal-initiated failovers are enabled. There are several -benefits to combining a manual failover process with Temporal-initiated failovers: +You can trigger manual failovers on your Namespaces even if automatic failovers are enabled. There are several +benefits to combining a manual failover process with automatic failovers: - You can detect outages that Temporal doesn't. In the cloud, regional outages don't affect all services equally. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other @@ -294,7 +294,7 @@ benefits to combining a manual failover process with Temporal-initiated failover that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. -- Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers +- Even if you have robust tooling to detect an outage and trigger a failover, leaving automatic failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet. @@ -310,7 +310,7 @@ and apply in different situations. | How is it measured? | The achieved recovery time is measured in terms of minutes per outage. | The achieved service error rate is measured in terms of error rate per month. | | How is the calculation performed? | The achieved recovery time in a given outage is the total time between when a disruption to a Namespace began and when the Namespace was restored to full functionality, either after a failover to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month. | | Do partial degradations count? | Most outages contain periods of **partial degradation** where some percentage of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time. | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate. | -| What is excluded? | For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%. | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). | +| What is excluded? | For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%. | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and automatic failovers enabled. If a Namespace has the relevant High Availability feature and has automatic failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). | The following examples illustrate the RTO and SLA calculations for different types of outages in a regional outage. These hypothetical Namespaces are based on actual Temporal Cloud performance in a From 508d477fb7438e2141c9f64d45d49ba55f191011 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 11 Jun 2026 10:43:34 -0700 Subject: [PATCH 2/9] First draft of deployment models page --- .../high-availability/deployment-models.mdx | 215 ++++++++++++++++++ docs/cloud/high-availability/enable.mdx | 6 +- .../high-availability/failovers/manage.mdx | 2 + .../high-availability/ha-connectivity.mdx | 4 + docs/cloud/high-availability/index.mdx | 4 + sidebars.js | 1 + 6 files changed, 230 insertions(+), 2 deletions(-) create mode 100644 docs/cloud/high-availability/deployment-models.mdx diff --git a/docs/cloud/high-availability/deployment-models.mdx b/docs/cloud/high-availability/deployment-models.mdx new file mode 100644 index 0000000000..0df455dd80 --- /dev/null +++ b/docs/cloud/high-availability/deployment-models.mdx @@ -0,0 +1,215 @@ +--- +id: deployment-models +title: Deployment models for High Availability +sidebar_label: Deployment models for High Availability +slug: /cloud/high-availability/deployment-models +description: Choose a Worker deployment model — Active / Passive, Active / Active, or Dual Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of your architecture fails over with it. +tags: + - Temporal Cloud + - High Availability +keywords: + - high availability + - failover + - worker deployment + - active passive + - active active + - hot standby + - temporal cloud +--- + +A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service for you, but it can't move the rest of your architecture. +On failover, Temporal Cloud promotes the replica to active and reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) — but your Workers, Workflow starters, Codec Servers, and the external systems your Workflows depend on each need their own region story. + +The biggest factor in your real-world [Recovery Time](/cloud/rpo-rto) is your **Worker deployment model**: where your Worker fleets run and which region processes Workflows at any given moment. +This page describes the common models and how the rest of your architecture fits around each one. + +A useful property to keep in mind: **Workers don't need to run in the same region as the active replica.** A Worker fleet in one region can poll a Namespace that is active in another, because the Namespace Endpoint routes to whichever region is currently active. You decide where compute runs; Temporal Cloud routes to the active replica. + +## Terminology {/* #terminology */} + +This page uses two terms for the regions of an HA Namespace: + +- **Primary region** — the region you prefer to run active in during normal operation. +- **Secondary region** — the other region in the Namespace. It holds the passive replica during normal operation and becomes active after a failover. + +A Temporal Cloud HA Namespace has exactly **one active replica at a time**, in one region. +The other region holds a passive replica that receives replicated state asynchronously. +This is true for every model below — including the ones that present an "active/active" face to the rest of your stack. + +## What needs a failover story {/* #what-needs-a-failover-story */} + +Beyond the Namespace itself, these components live in your environment and have to be planned for: + +- **Workers** — execute your Workflows and Activities. Covered in depth below. +- **Workflow starters and Clients** — start and signal Workflows. Point these at the Namespace Endpoint so they follow the active region automatically. +- **Codec Servers** — encode and decode payloads for the Web UI and CLI. These sit alongside the region a Worker or operator connects through. +- **Proxies between your Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. +- **External databases and queues** — the systems your Activities read and write. + +Codec Servers and proxies sit in the request path between your Workers and Temporal Cloud, so they must be available wherever your Workers connect from. +The rule of thumb for every model: a Codec Server and any proxy must be running in each region where Workers actively connect, and must scale with the active region. +The [Worker deployment patterns](#worker-deployment-patterns) below note when each piece needs to be running ahead of time versus scaled up after a failover. + +## Worker deployment patterns {/* #worker-deployment-patterns */} + +The models trade off three things: **Recovery Time** after an outage, **steady-state cost**, and **operational complexity**. +None is one-size-fits-all. Start with Active / Passive and move toward the others only when your Recovery Time or latency requirements call for it. + +### Active / Passive (recommended) {/* #active-passive */} + +Workers are active in only one region at a time. This is the most common model and the recommended starting point. + +It assumes the rest of your stack is also single-region-active at any moment: traffic routing, databases, and queues are all active in one region and fail over to the secondary region together with your Workers. +You don't have to reason about two regions mutating the same external state at once. + +Codec Servers and proxies follow the active region. They run in whichever region currently holds your active Workers and scale up in the secondary region as part of a failover. + +Active / Passive comes in two flavors that trade cost against Recovery Time. + +#### Active / Cold (most common) {/* #active-cold */} + +Workers run **only in the primary region**. The secondary region runs the passive replica but none of your Workers. + +```mermaid +graph LR + subgraph primary["Primary region"] + W1["Worker fleet"] + A1["Active replica"] + end + subgraph secondary["Secondary region"] + W2["Worker fleet
(brought up after failover)"] + A2["Passive replica"] + end + W1 --> A1 + A1 -. replicates .-> A2 + W2 -.-> A2 +``` + +On failover, the Namespace is ready in the secondary region immediately, but your Workers start there from nothing — a "cold" start. +Recovery Time includes container or VM startup, image pulls, and application warm-up before throughput returns to normal. +Because Workers need them, your Codec Server and any Worker proxies also have to be scaled up in the secondary region after the failover. + +This is the simplest model to operate — in steady state it looks like a single-region deployment — and the cheapest, since you pay for one Worker fleet. The trade-off is the highest Recovery Time of the models here, so invest in tested automation to bring up the secondary-region fleet quickly. + +#### Active / Hot {/* #active-hot */} + +Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet idle. + +```mermaid +graph LR + subgraph primary["Primary region"] + W1["Worker fleet
(processing)"] + A1["Active replica"] + end + subgraph secondary["Secondary region"] + W2["Worker fleet
(connected, idle)"] + A2["Passive replica"] + end + W1 --> A1 + W2 --> A2 + A1 -. replicates .-> A2 +``` + +You achieve this by disabling forwarding for Worker polls and connecting each fleet to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity). +With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the idle fleet does no work and adds no cross-region overhead. + +This makes failover a breeze: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the instant the secondary region becomes active, so this model achieves the lowest Recovery Time. + +The trade-off is cost: you pay for idle Worker capacity in the secondary region at all times. For the same reason, your Codec Servers and proxies must run in **both regions continuously**, not just after a failover. + +Active / Hot is recommended when you need the lowest possible Recovery Time and can absorb the cost of idle capacity. + +:::tip Disabling forwarding + +To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + +::: + +### Active / Active {/* #active-active */} + +In this model, Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). + +A Temporal Cloud Namespace is not "active/active" in the database sense — it still has a single active replica in one region. +But because the passive replica transparently forwards requests to and from the active region, the Namespace fits into a broader active/active architecture: a Worker fleet in either region can process Workflows, and the secondary fleet's polls are forwarded across regions to the active replica. + +```mermaid +graph LR + subgraph primary["Primary region"] + W1["Worker fleet"] + A1["Active replica"] + end + subgraph secondary["Secondary region"] + W2["Worker fleet"] + A2["Passive replica"] + end + W1 --> A1 + W2 --> A2 + A2 -- "forwards polls" --> A1 + A1 -. replicates .-> A2 +``` + +This is a practical way to get a low Recovery Time while balancing cost. You can run roughly half your fleet in each region, then add capacity to the surviving region during an outage to reach full throughput. +Unlike Active / Cold, Workflows keep processing in the surviving region while you scale up, so there is no cold-start gap. + +The trade-offs are integration and latency: + +- **Synchronizing with external systems is harder.** With Workers active in both regions, external systems such as databases and queues are trickier to keep consistent than in a single-region-active model. +- **The secondary region pays cross-region latency.** Polls from the secondary-region fleet are forwarded to the active replica, so that region sees higher latency. This can be a problem for latency-sensitive Workflows. + +As with the other multi-region models, Codec Servers and proxies must run in both regions at all times. + +### Dual Active (Multi-Active) {/* #dual-active */} + +Some architectures need low-latency or region-bound data in *each* region at once. You can achieve this with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. + +```mermaid +graph LR + subgraph r1["Region 1"] + WA["App A Workers"] + A1a["Namespace A
active"] + B1p["Namespace B
passive"] + end + subgraph r2["Region 2"] + WB["App B Workers"] + B2a["Namespace B
active"] + A2p["Namespace A
passive"] + end + WA --> A1a + WB --> B2a + A1a -. replicates .-> A2p + B2a -. replicates .-> B1p +``` + +Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. You can extend the same idea across more than two regions. + +Workloads on Temporal rarely need this. It only pays off when a workload is *both* extremely latency-sensitive across several same-continent regions *and* needs multi-region disaster recovery, which is an uncommon combination. +We recommend treating each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. + +## Choose a deployment model {/* #choose */} + +| Model | Recovery Time | Steady-state cost | Best when | +| --- | --- | --- | --- | +| **Active / Cold** | Highest (cold start in secondary) | Lowest (one fleet) | You're adopting HA and want the simplest operating model. | +| **Active / Hot** | Lowest (warm, no DNS wait) | Higher (idle fleet) | You need the lowest Recovery Time and your data plane is pinned to one region at a time. | +| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | You want low Recovery Time at balanced cost and can tolerate cross-region latency on the secondary region. | +| **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | You truly need low-latency, region-bound data in each region. Rare. | + +## The rest of your architecture {/* #rest-of-architecture */} + +The Worker model sets the pattern; these supporting pieces follow it. + +- **Workflow starters and Clients.** Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover. Use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when you deliberately need to pin a Client to a region. +- **Codec Servers and proxies.** Anything in the connection path between your Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Cold, scale them up in the secondary region as part of a failover; in the hot and active/active models, run them in both regions at all times. +- **External databases and queues.** These remain your responsibility and the right approach depends on your Worker model: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions you must design for. Detailed guidance is out of scope for this page. + +## Related {/* #related */} + +To add a replica and turn on High Availability features, see [Enable and manage High Availability](/cloud/high-availability/enable). + +To choose between the Namespace Endpoint and Regional Endpoints and to set up private connectivity, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). + +To stop forwarding Worker polls to the active region for the Active / Hot model, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). + +To trigger and manage failovers, see [Failovers](/cloud/high-availability/failovers). + +To understand the recovery objectives each model is measured against, see [RPO and RTO](/cloud/rpo-rto). diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index dc5e7a8d61..f403f3a72f 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -137,6 +137,8 @@ Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activit Same-region replicas are not affected by this setting. +To deploy Worker fleets in both regions that stay idle in the passive region until failover, see [Active / Hot](/cloud/high-availability/deployment-models#active-hot). + :::info To see which endpoints route to which replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). @@ -152,10 +154,10 @@ Use the [`temporal cloud namespace ha update`](/cli/command-reference/cloud/name ```bash temporal cloud namespace ha update \ --namespace . \ - --disable-passive-poller-forwarding true + --passive-poller-forwarding disable ``` -Set the flag to `false` to re-enable forwarding. +Set the flag to `enable` to re-enable forwarding. ### Set the forwarding behavior with the Cloud Ops API {/* #set-forwarding-curl */} diff --git a/docs/cloud/high-availability/failovers/manage.mdx b/docs/cloud/high-availability/failovers/manage.mdx index 76da3a7766..7d126ea673 100644 --- a/docs/cloud/high-availability/failovers/manage.mdx +++ b/docs/cloud/high-availability/failovers/manage.mdx @@ -174,6 +174,8 @@ the replica, the DNS redirection orchestrated by Temporal ensures that your exis Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and the responses back, so Workers keep running through a failover. +To choose where your Worker fleets run across regions, see [Deployment models for High Availability](/cloud/high-availability/deployment-models). + To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). diff --git a/docs/cloud/high-availability/ha-connectivity.mdx b/docs/cloud/high-availability/ha-connectivity.mdx index 0dd79c3262..49571f106c 100644 --- a/docs/cloud/high-availability/ha-connectivity.mdx +++ b/docs/cloud/high-availability/ha-connectivity.mdx @@ -93,6 +93,10 @@ To learn what forwarding does, see [Request forwarding](/cloud/high-availability To stop forwarding Worker polls on a Namespace, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-models#active-active). + +To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Hot](/cloud/high-availability/deployment-models#active-hot). + ## How to use PrivateLink with High Availability features :::tip diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index 69b2bfb115..22471a6ea0 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -105,6 +105,10 @@ To route Workers to the passive region's replica, see [How requests reach the re To disable passive region replica forwarding, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-models#active-active). + +To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Hot](/cloud/high-availability/deployment-models#active-hot). + ## Service levels and recovery objectives Namespaces using High Availability have a 99.99% [uptime SLA](/cloud/sla) with sub-1-minute [RPO](/cloud/rpo-rto) and 20-minute [RTO](/cloud/rpo-rto). For detailed information: diff --git a/sidebars.js b/sidebars.js index bf7a032fcd..4230868041 100644 --- a/sidebars.js +++ b/sidebars.js @@ -1001,6 +1001,7 @@ module.exports = { }, items: [ 'cloud/high-availability/enable', + 'cloud/high-availability/deployment-models', 'cloud/high-availability/monitoring', { type: 'category', From d1a61cc5d2888008195524d31dd38e30ab77bde3 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 11 Jun 2026 11:02:47 -0700 Subject: [PATCH 3/9] edits to deployment models --- .../high-availability/deployment-models.mdx | 24 ++++++++++--------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/cloud/high-availability/deployment-models.mdx b/docs/cloud/high-availability/deployment-models.mdx index 0df455dd80..215f0db9ab 100644 --- a/docs/cloud/high-availability/deployment-models.mdx +++ b/docs/cloud/high-availability/deployment-models.mdx @@ -18,23 +18,25 @@ keywords: --- A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service for you, but it can't move the rest of your architecture. -On failover, Temporal Cloud promotes the replica to active and reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) — but your Workers, Workflow starters, Codec Servers, and the external systems your Workflows depend on each need their own region story. +On failover, Temporal Cloud promotes the replica to active and reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) — but your Workers, Workflow starters, Codec Servers, and the external systems your Workflows depend on each need their own failover story. -The biggest factor in your real-world [Recovery Time](/cloud/rpo-rto) is your **Worker deployment model**: where your Worker fleets run and which region processes Workflows at any given moment. -This page describes the common models and how the rest of your architecture fits around each one. - -A useful property to keep in mind: **Workers don't need to run in the same region as the active replica.** A Worker fleet in one region can poll a Namespace that is active in another, because the Namespace Endpoint routes to whichever region is currently active. You decide where compute runs; Temporal Cloud routes to the active replica. +A critical piece of the [Recovery Time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment model**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. +This page describes common patterns for deploying Workers and how the rest of your architecture fits into an overall High Availability strategy. ## Terminology {/* #terminology */} -This page uses two terms for the regions of an HA Namespace: +This page uses two terms for the regions of a Namespace with High Availability: + +- **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." +- **Secondary region** — the region the Namespace fails over to. It holds a replica and is passive during normal operation. + +:::info -- **Primary region** — the region you prefer to run active in during normal operation. -- **Secondary region** — the other region in the Namespace. It holds the passive replica during normal operation and becomes active after a failover. +**Namespaces are always active / passive, but can support an Active / Active pattern.** A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. However, Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. + +::: -A Temporal Cloud HA Namespace has exactly **one active replica at a time**, in one region. -The other region holds a passive replica that receives replicated state asynchronously. -This is true for every model below — including the ones that present an "active/active" face to the rest of your stack. +A useful property to keep in mind: **Workers don't need to run in the same region as the active replica.** A Worker fleet in one region can poll a Namespace that is active in another. ## What needs a failover story {/* #what-needs-a-failover-story */} From 3f98c9012061cbafec0529f4565408f4563397f2 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 11 Jun 2026 11:46:38 -0700 Subject: [PATCH 4/9] more edits to deployment models --- .../high-availability/deployment-models.mdx | 409 +++++++++++++----- 1 file changed, 308 insertions(+), 101 deletions(-) diff --git a/docs/cloud/high-availability/deployment-models.mdx b/docs/cloud/high-availability/deployment-models.mdx index 215f0db9ab..a65d6b173d 100644 --- a/docs/cloud/high-availability/deployment-models.mdx +++ b/docs/cloud/high-availability/deployment-models.mdx @@ -3,7 +3,7 @@ id: deployment-models title: Deployment models for High Availability sidebar_label: Deployment models for High Availability slug: /cloud/high-availability/deployment-models -description: Choose a Worker deployment model — Active / Passive, Active / Active, or Dual Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of your architecture fails over with it. +description: Choose a Worker deployment model — Active / Passive, Active / Active, or Dual Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. tags: - Temporal Cloud - High Availability @@ -17,11 +17,11 @@ keywords: - temporal cloud --- -A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service for you, but it can't move the rest of your architecture. -On failover, Temporal Cloud promotes the replica to active and reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) — but your Workers, Workflow starters, Codec Servers, and the external systems your Workflows depend on each need their own failover story. +A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service automatically, but it does not move the rest of the architecture. +On failover, Temporal Cloud promotes the replica to active and reroutes the [Namespace Endpoint](/cloud/namespaces#access-namespaces). Workers, Workflow starters, Codec Servers, and the external systems that Workflows depend on each need their own failover story. A critical piece of the [Recovery Time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment model**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. -This page describes common patterns for deploying Workers and how the rest of your architecture fits into an overall High Availability strategy. +This page describes common patterns for deploying Workers and how the rest of the architecture fits into an overall High Availability strategy. ## Terminology {/* #terminology */} @@ -40,86 +40,192 @@ A useful property to keep in mind: **Workers don't need to run in the same regio ## What needs a failover story {/* #what-needs-a-failover-story */} -Beyond the Namespace itself, these components live in your environment and have to be planned for: +Beyond the Namespace itself, these components live in the application environment and must be planned for: -- **Workers** — execute your Workflows and Activities. Covered in depth below. -- **Workflow starters and Clients** — start and signal Workflows. Point these at the Namespace Endpoint so they follow the active region automatically. -- **Codec Servers** — encode and decode payloads for the Web UI and CLI. These sit alongside the region a Worker or operator connects through. -- **Proxies between your Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. -- **External databases and queues** — the systems your Activities read and write. +- **Workers** — execute Workflows and Activities. +- **Workflow starters and Clients** — start and signal Workflows. +- **Codec Servers** — encode and decode payloads for Workers, the Web UI, and the CLI. +- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. +- **External databases and queues** — the systems that Activities read and write. -Codec Servers and proxies sit in the request path between your Workers and Temporal Cloud, so they must be available wherever your Workers connect from. -The rule of thumb for every model: a Codec Server and any proxy must be running in each region where Workers actively connect, and must scale with the active region. +Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, external databases). The [Worker deployment patterns](#worker-deployment-patterns) below note when each piece needs to be running ahead of time versus scaled up after a failover. ## Worker deployment patterns {/* #worker-deployment-patterns */} The models trade off three things: **Recovery Time** after an outage, **steady-state cost**, and **operational complexity**. -None is one-size-fits-all. Start with Active / Passive and move toward the others only when your Recovery Time or latency requirements call for it. +None is one-size-fits-all. Start with Active / Passive and move toward the others only when Recovery Time or latency requirements call for it. -### Active / Passive (recommended) {/* #active-passive */} +The diagrams below use a shared visual language: + +- A green border marks the **active** Temporal Cloud replica and the Workers processing against it. +- A muted dashed border marks the **passive** replica; a gold dashed border marks **idle** standby Workers. +- A purple fill marks application-owned systems (Workers, databases, queues). +- A red tint marks the region that is **down** during a failover. -Workers are active in only one region at a time. This is the most common model and the recommended starting point. +### Active / Passive (recommended) {/* #active-passive */} -It assumes the rest of your stack is also single-region-active at any moment: traffic routing, databases, and queues are all active in one region and fail over to the secondary region together with your Workers. -You don't have to reason about two regions mutating the same external state at once. +In an Active / Passive model, Workers are active in only one region at a time. This is the most common model and the recommended starting point. -Codec Servers and proxies follow the active region. They run in whichever region currently holds your active Workers and scale up in the secondary region as part of a failover. +It assumes the surrounding stack is also single-region-active at any moment: traffic routing, databases, and queues are all active in one region and fail over to the secondary region together with the Workers. There is no need to reason about two regions mutating the same external state at once. -Active / Passive comes in two flavors that trade cost against Recovery Time. +Active / Passive has two flavors that trade cost against Recovery Time. #### Active / Cold (most common) {/* #active-cold */} -Workers run **only in the primary region**. The secondary region runs the passive replica but none of your Workers. +Workers run **only in the primary region**. The secondary region holds the passive replica but runs none of the application's Workers. + +**Steady state** ```mermaid -graph LR - subgraph primary["Primary region"] - W1["Worker fleet"] - A1["Active replica"] - end - subgraph secondary["Secondary region"] - W2["Worker fleet
(brought up after failover)"] - A2["Passive replica"] - end - W1 --> A1 - A1 -. replicates .-> A2 - W2 -.-> A2 +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + subgraph P["PRIMARY REGION"] + direction TB + W1["Workers
processing"]:::wactive + DB1["External DB / queue"]:::ext + R1["Namespace replica
ACTIVE"]:::tcloud + end + subgraph S["SECONDARY REGION"] + direction TB + R2["Namespace replica
passive"]:::tpassive + end + W1 --> R1 + W1 <--> DB1 + R1 -. replicates .-> R2 ``` -On failover, the Namespace is ready in the secondary region immediately, but your Workers start there from nothing — a "cold" start. +**Failover** + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; + subgraph P["PRIMARY REGION — OUTAGE"] + direction TB + W1["Workers
unavailable"]:::down + DB1["External DB / queue"]:::down + R1["Namespace replica"]:::down + end + subgraph S["SECONDARY REGION"] + direction TB + W2["Workers
cold start"]:::wnew + DB2["External DB / queue
promoted"]:::ext + R2["Namespace replica
ACTIVE"]:::tcloud + end + W2 --> R2 + W2 <--> DB2 + style P fill:#ED360E14,stroke:#ED360E +``` + +On failover, the Namespace is active in the secondary region immediately, but the Workers there start from nothing, a "cold" start. Recovery Time includes container or VM startup, image pulls, and application warm-up before throughput returns to normal. -Because Workers need them, your Codec Server and any Worker proxies also have to be scaled up in the secondary region after the failover. -This is the simplest model to operate — in steady state it looks like a single-region deployment — and the cheapest, since you pay for one Worker fleet. The trade-off is the highest Recovery Time of the models here, so invest in tested automation to bring up the secondary-region fleet quickly. +**Benefits** + +- Simplest model to operate; in steady state it resembles a single-region deployment. +- Lowest steady-state cost: a single Worker fleet. + +**Tradeoffs** + +- Highest Recovery Time of the models here, gated by Worker startup in the secondary region. +- Depends on tested automation to bring up the secondary-region fleet quickly. + +**Component behavior** + +- **Workers** — run only in the primary region; brought up in the secondary region during a failover. +- **Workflow starters and Clients** — run with the Workers; brought up in the secondary region during a failover. +- **Codec Servers and proxies** — run alongside the active Workers; scaled up in the secondary region as part of a failover. +- **External databases and queues** — single-region-active; fail over to the secondary region alongside the Workers. #### Active / Hot {/* #active-hot */} Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet idle. +This is achieved by disabling forwarding for Worker polls and connecting each fleet to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity). +With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the idle fleet does no work and adds no cross-region overhead. + +**Steady state** + ```mermaid -graph LR - subgraph primary["Primary region"] - W1["Worker fleet
(processing)"] - A1["Active replica"] - end - subgraph secondary["Secondary region"] - W2["Worker fleet
(connected, idle)"] - A2["Passive replica"] - end - W1 --> A1 - W2 --> A2 - A1 -. replicates .-> A2 +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef widle fill:#7C3AED18,stroke:#FECB2F,stroke-width:1px,stroke-dasharray:4 3; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + subgraph P["PRIMARY REGION"] + direction TB + W1["Workers
processing"]:::wactive + DB1["External DB / queue"]:::ext + R1["Namespace replica
ACTIVE"]:::tcloud + end + subgraph S["SECONDARY REGION"] + direction TB + W2["Workers
connected, idle"]:::widle + DB2["External DB / queue
standby"]:::ext + R2["Namespace replica
passive"]:::tpassive + end + W1 --> R1 + W1 <--> DB1 + W2 -. idle .-> R2 + W2 <--> DB2 + R1 -. replicates .-> R2 ``` -You achieve this by disabling forwarding for Worker polls and connecting each fleet to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity). -With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the idle fleet does no work and adds no cross-region overhead. +**Failover** + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; + subgraph P["PRIMARY REGION — OUTAGE"] + direction TB + W1["Workers"]:::down + R1["Namespace replica"]:::down + end + subgraph S["SECONDARY REGION"] + direction TB + W2["Workers
now processing"]:::wactive + DB2["External DB / queue
promoted"]:::ext + R2["Namespace replica
ACTIVE"]:::tcloud + end + W2 --> R2 + W2 <--> DB2 + style P fill:#ED360E14,stroke:#ED360E +``` + +Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this model achieves the lowest Recovery Time. + +**Benefits** -This makes failover a breeze: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the instant the secondary region becomes active, so this model achieves the lowest Recovery Time. +- Lowest Recovery Time: the secondary-region Workers are already connected and warm. +- Low steady-state latency: Tasks are processed only in the active region, with no cross-region forwarding. -The trade-off is cost: you pay for idle Worker capacity in the secondary region at all times. For the same reason, your Codec Servers and proxies must run in **both regions continuously**, not just after a failover. +**Tradeoffs** -Active / Hot is recommended when you need the lowest possible Recovery Time and can absorb the cost of idle capacity. +- Highest steady-state cost: idle Worker capacity runs in the secondary region at all times. +- Requires Regional Endpoints or VPC Endpoints and the `disablePassivePollerForwarding` setting. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. + +**Component behavior** + +- **Workers** — run in both regions; only the active region processes Workflows. +- **Workflow starters and Clients** — run in both regions alongside the Workers. +- **Codec Servers and proxies** — run in both regions continuously, not just after a failover. +- **External databases and queues** — typically single-region-active; fail over alongside the active Workers. :::tip Disabling forwarding @@ -131,78 +237,179 @@ To stop forwarding Worker polls to the active region, see [Change the forwarding In this model, Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). -A Temporal Cloud Namespace is not "active/active" in the database sense — it still has a single active replica in one region. -But because the passive replica transparently forwards requests to and from the active region, the Namespace fits into a broader active/active architecture: a Worker fleet in either region can process Workflows, and the secondary fleet's polls are forwarded across regions to the active replica. +A Temporal Cloud Namespace is not "active/active" in the database sense; it still has a single active replica in one region. +Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls are forwarded across regions to the active replica. + +**Steady state** ```mermaid -graph LR - subgraph primary["Primary region"] - W1["Worker fleet"] - A1["Active replica"] - end - subgraph secondary["Secondary region"] - W2["Worker fleet"] - A2["Passive replica"] - end - W1 --> A1 - W2 --> A2 - A2 -- "forwards polls" --> A1 - A1 -. replicates .-> A2 +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + subgraph P["PRIMARY REGION"] + direction TB + W1["Workers
processing"]:::wactive + DB1["External DB / queue"]:::ext + R1["Namespace replica
ACTIVE"]:::tcloud + end + subgraph S["SECONDARY REGION"] + direction TB + W2["Workers
processing"]:::wactive + DB2["External DB / queue"]:::ext + R2["Namespace replica
passive"]:::tpassive + end + W1 --> R1 + W1 <--> DB1 + W2 --> R2 + R2 ==>|forwards polls| R1 + W2 <--> DB2 + R1 -. replicates .-> R2 ``` -This is a practical way to get a low Recovery Time while balancing cost. You can run roughly half your fleet in each region, then add capacity to the surviving region during an outage to reach full throughput. -Unlike Active / Cold, Workflows keep processing in the surviving region while you scale up, so there is no cold-start gap. +**Failover** -The trade-offs are integration and latency: +```mermaid +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; + subgraph P["PRIMARY REGION — OUTAGE"] + direction TB + W1["Workers"]:::down + R1["Namespace replica"]:::down + end + subgraph S["SECONDARY REGION"] + direction TB + W2["Workers
keep processing"]:::wactive + W3["Workers
scaled up"]:::wnew + DB2["External DB / queue"]:::ext + R2["Namespace replica
ACTIVE"]:::tcloud + end + W2 --> R2 + W3 --> R2 + W2 <--> DB2 + style P fill:#ED360E14,stroke:#ED360E +``` + +This is a practical way to reach a low Recovery Time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. +Unlike Active / Cold, Workflows keep processing in the surviving region while capacity scales up, so there is no cold-start gap. + +**Benefits** + +- Low Recovery Time: the surviving region keeps processing while capacity scales up. +- Balanced cost: roughly half the fleet runs in each region during normal operation. -- **Synchronizing with external systems is harder.** With Workers active in both regions, external systems such as databases and queues are trickier to keep consistent than in a single-region-active model. -- **The secondary region pays cross-region latency.** Polls from the secondary-region fleet are forwarded to the active replica, so that region sees higher latency. This can be a problem for latency-sensitive Workflows. +**Tradeoffs** -As with the other multi-region models, Codec Servers and proxies must run in both regions at all times. +- The secondary region pays cross-region latency, because its polls are forwarded to the active replica. This can be a problem for latency-sensitive Workflows. +- Synchronizing external systems is harder, because Workers are active in both regions at once. + +**Component behavior** + +- **Workers** — run and process in both regions; the secondary region's polls are forwarded to the active replica. +- **Workflow starters and Clients** — run in both regions. +- **Codec Servers and proxies** — run in both regions continuously. +- **External databases and queues** — accessed from both regions; cross-region consistency must be designed for. ### Dual Active (Multi-Active) {/* #dual-active */} -Some architectures need low-latency or region-bound data in *each* region at once. You can achieve this with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. +Some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. + +**Steady state** ```mermaid -graph LR - subgraph r1["Region 1"] - WA["App A Workers"] - A1a["Namespace A
active"] - B1p["Namespace B
passive"] - end - subgraph r2["Region 2"] - WB["App B Workers"] - B2a["Namespace B
active"] - A2p["Namespace A
passive"] - end - WA --> A1a - WB --> B2a - A1a -. replicates .-> A2p - B2a -. replicates .-> B1p +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + subgraph R1["REGION 1"] + direction TB + WA["App A Workers"]:::wactive + NAa["Namespace A
ACTIVE"]:::tcloud + NBp["Namespace B
passive"]:::tpassive + end + subgraph R2["REGION 2"] + direction TB + WB["App B Workers"]:::wactive + NBa["Namespace B
ACTIVE"]:::tcloud + NAp["Namespace A
passive"]:::tpassive + end + WA --> NAa + WB --> NBa + NAa -. replicates .-> NAp + NBa -. replicates .-> NBp +``` + +**Failover (Region 1 outage)** + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +flowchart LR + classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; + classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; + classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; + classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; + subgraph R1["REGION 1 — OUTAGE"] + direction TB + WA["App A Workers"]:::down + NAa["Namespace A"]:::down + end + subgraph R2["REGION 2"] + direction TB + WB["App B Workers"]:::wactive + WA2["App A Workers
brought up"]:::wnew + NBa["Namespace B
ACTIVE"]:::tcloud + NAa2["Namespace A
ACTIVE"]:::tcloud + end + WB --> NBa + WA2 --> NAa2 + style R1 fill:#ED360E14,stroke:#ED360E ``` -Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. You can extend the same idea across more than two regions. +Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. The same idea extends across more than two regions. Each Namespace fails over independently, following the Active / Passive sequence. + +Workloads on Temporal rarely need this. It pays off only when a workload is *both* extremely latency-sensitive across several same-continent regions *and* needs multi-region disaster recovery, an uncommon combination. + +**Benefits** + +- Low-latency, region-bound data in each region during normal operation. +- Each Namespace fails over independently, like Active / Passive. + +**Tradeoffs** + +- Highest cost and operational complexity: two Worker fleets and two Namespaces. +- Rarely justified. Temporal recommends modeling each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. + +**Component behavior** -Workloads on Temporal rarely need this. It only pays off when a workload is *both* extremely latency-sensitive across several same-continent regions *and* needs multi-region disaster recovery, which is an uncommon combination. -We recommend treating each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. +- **Workers** — one fleet per application, each active in its Namespace's region. +- **Workflow starters and Clients** — run with each application's Workers. +- **Codec Servers and proxies** — run in both regions, for both Namespaces. +- **External databases and queues** — region-bound per application; each fails over with its Namespace. ## Choose a deployment model {/* #choose */} | Model | Recovery Time | Steady-state cost | Best when | | --- | --- | --- | --- | -| **Active / Cold** | Highest (cold start in secondary) | Lowest (one fleet) | You're adopting HA and want the simplest operating model. | -| **Active / Hot** | Lowest (warm, no DNS wait) | Higher (idle fleet) | You need the lowest Recovery Time and your data plane is pinned to one region at a time. | -| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | You want low Recovery Time at balanced cost and can tolerate cross-region latency on the secondary region. | -| **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | You truly need low-latency, region-bound data in each region. Rare. | +| **Active / Cold** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operating model. | +| **Active / Hot** | Lowest (warm, no DNS wait) | Higher (idle fleet) | The lowest Recovery Time is required and the data plane is pinned to one region at a time. | +| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low Recovery Time at balanced cost, where the secondary region can tolerate cross-region latency. | +| **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | Low-latency, region-bound data is genuinely required in each region. Rare. | -## The rest of your architecture {/* #rest-of-architecture */} +## The rest of the architecture {/* #rest-of-architecture */} -The Worker model sets the pattern; these supporting pieces follow it. +The Worker model sets the pattern; the supporting pieces follow it. -- **Workflow starters and Clients.** Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover. Use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when you deliberately need to pin a Client to a region. -- **Codec Servers and proxies.** Anything in the connection path between your Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Cold, scale them up in the secondary region as part of a failover; in the hot and active/active models, run them in both regions at all times. -- **External databases and queues.** These remain your responsibility and the right approach depends on your Worker model: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions you must design for. Detailed guidance is out of scope for this page. +- **Workflow starters and Clients.** Deploy these with the same regional pattern as the Workers, since a starter or Client often shares the same in-region dependencies (databases, queues, upstream services) and should fail over alongside them. Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover, and use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when a Client must be pinned to a region. +- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Cold, scale them up in the secondary region as part of a failover; in the hot and active/active models, run them in both regions at all times. +- **External databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker model: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. ## Related {/* #related */} From 810fa7df0647e6c3c94f3e8069db535709b7db86 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 11 Jun 2026 13:08:41 -0700 Subject: [PATCH 5/9] more updates --- .../high-availability/deployment-models.mdx | 83 +++++++++++-------- 1 file changed, 48 insertions(+), 35 deletions(-) diff --git a/docs/cloud/high-availability/deployment-models.mdx b/docs/cloud/high-availability/deployment-models.mdx index a65d6b173d..b3574e2114 100644 --- a/docs/cloud/high-availability/deployment-models.mdx +++ b/docs/cloud/high-availability/deployment-models.mdx @@ -3,7 +3,7 @@ id: deployment-models title: Deployment models for High Availability sidebar_label: Deployment models for High Availability slug: /cloud/high-availability/deployment-models -description: Choose a Worker deployment model — Active / Passive, Active / Active, or Dual Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. +description: Choose a Worker deployment model — Active / Passive (Cold), Active / Passive (Hot), or Active / Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. tags: - Temporal Cloud - High Availability @@ -20,7 +20,7 @@ keywords: A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service automatically, but it does not move the rest of the architecture. On failover, Temporal Cloud promotes the replica to active and reroutes the [Namespace Endpoint](/cloud/namespaces#access-namespaces). Workers, Workflow starters, Codec Servers, and the external systems that Workflows depend on each need their own failover story. -A critical piece of the [Recovery Time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment model**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. +A critical piece of the [recovery time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment model**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. This page describes common patterns for deploying Workers and how the rest of the architecture fits into an overall High Availability strategy. ## Terminology {/* #terminology */} @@ -53,8 +53,14 @@ The [Worker deployment patterns](#worker-deployment-patterns) below note when ea ## Worker deployment patterns {/* #worker-deployment-patterns */} -The models trade off three things: **Recovery Time** after an outage, **steady-state cost**, and **operational complexity**. -None is one-size-fits-all. Start with Active / Passive and move toward the others only when Recovery Time or latency requirements call for it. +This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. +They trade off **recovery time** after an outage, **steady-state cost**, and **operational complexity**. + +:::tip Temporal's recommendation + +Start with **Active / Passive (Cold)**. It is the easiest to set up, the easiest to reason about, and the lowest cost. Move to another pattern only when business requirements warrant it. + +::: The diagrams below use a shared visual language: @@ -63,15 +69,9 @@ The diagrams below use a shared visual language: - A purple fill marks application-owned systems (Workers, databases, queues). - A red tint marks the region that is **down** during a failover. -### Active / Passive (recommended) {/* #active-passive */} - -In an Active / Passive model, Workers are active in only one region at a time. This is the most common model and the recommended starting point. +### Active / Passive (Cold) {/* #active-cold */} -It assumes the surrounding stack is also single-region-active at any moment: traffic routing, databases, and queues are all active in one region and fail over to the secondary region together with the Workers. There is no need to reason about two regions mutating the same external state at once. - -Active / Passive has two flavors that trade cost against Recovery Time. - -#### Active / Cold (most common) {/* #active-cold */} +_Also known as Active / Cold Standby, Active / Cold, or simply Active / Passive. Recommended and simplest._ Workers run **only in the primary region**. The secondary region holds the passive replica but runs none of the application's Workers. @@ -128,18 +128,27 @@ flowchart LR ``` On failover, the Namespace is active in the secondary region immediately, but the Workers there start from nothing, a "cold" start. -Recovery Time includes container or VM startup, image pulls, and application warm-up before throughput returns to normal. +Recovery time includes container or VM startup, image pulls, and application warm-up before throughput returns to normal. **Benefits** +- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the model pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. - Simplest model to operate; in steady state it resembles a single-region deployment. - Lowest steady-state cost: a single Worker fleet. **Tradeoffs** -- Highest Recovery Time of the models here, gated by Worker startup in the secondary region. +- Highest recovery time of the models here, gated by Worker startup in the secondary region. - Depends on tested automation to bring up the secondary-region fleet quickly. +**Recommendations and important constraints** + +- **Use the Namespace Endpoint.** Connect Workers through the Namespace Endpoint rather than a Regional Endpoint. If an error affects only the Namespace and the primary region's Workers stay healthy, the Namespace Endpoint follows the failover and those Workers reach the new active region cross-region. With private connectivity, the Workers need a network route to the cross-region VPC Endpoint. The alternative is to fail over the Workers and all of their dependencies whenever the Namespace fails over, so that no request crosses regions. +- **Route Workers to the active region's Codec Server and proxy.** There are two common approaches: + 1. Put DNS or a load balancer in front of the Codec Server and proxy address, and update it on failover to point at the new region's instances. + 2. Pass each Worker the Codec Server and proxy address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. +- **Single-region processing is the operator's responsibility.** To run Workers in only one region at a time, scale them down in the primary region before scaling them up in the secondary region. To enforce single-region processing within Temporal, use the [Active / Passive (Hot)](#active-hot) pattern instead. + **Component behavior** - **Workers** — run only in the primary region; brought up in the secondary region during a failover. @@ -147,7 +156,9 @@ Recovery Time includes container or VM startup, image pulls, and application war - **Codec Servers and proxies** — run alongside the active Workers; scaled up in the secondary region as part of a failover. - **External databases and queues** — single-region-active; fail over to the secondary region alongside the Workers. -#### Active / Hot {/* #active-hot */} +### Active / Passive (Hot) {/* #active-hot */} + +_Also known as Active / Hot Standby or Active / Hot._ Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet idle. @@ -208,17 +219,21 @@ flowchart LR style P fill:#ED360E14,stroke:#ED360E ``` -Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this model achieves the lowest Recovery Time. +Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this model achieves the lowest recovery time. **Benefits** -- Lowest Recovery Time: the secondary-region Workers are already connected and warm. +- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the model pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- Lowest recovery time: the secondary-region Workers are already connected and warm. - Low steady-state latency: Tasks are processed only in the active region, with no cross-region forwarding. **Tradeoffs** - Highest steady-state cost: idle Worker capacity runs in the secondary region at all times. -- Requires Regional Endpoints or VPC Endpoints and the `disablePassivePollerForwarding` setting. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. + +**Recommendations and important constraints** + +- Connect each Worker fleet through its region's [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) (or VPC Endpoint) and [disable forwarding](/cloud/high-availability/enable#change-forwarding-behavior) for Worker polls. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. **Component behavior** @@ -227,15 +242,9 @@ Failover is near-instant: the Namespace failover and the Worker "failover" happe - **Codec Servers and proxies** — run in both regions continuously, not just after a failover. - **External databases and queues** — typically single-region-active; fail over alongside the active Workers. -:::tip Disabling forwarding - -To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). - -::: - ### Active / Active {/* #active-active */} -In this model, Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). +Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). A Temporal Cloud Namespace is not "active/active" in the database sense; it still has a single active replica in one region. Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls are forwarded across regions to the active replica. @@ -297,12 +306,12 @@ flowchart LR style P fill:#ED360E14,stroke:#ED360E ``` -This is a practical way to reach a low Recovery Time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. -Unlike Active / Cold, Workflows keep processing in the surviving region while capacity scales up, so there is no cold-start gap. +This is a practical way to reach a low recovery time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. +Unlike Active / Passive (Cold), Workflows keep processing in the surviving region while capacity scales up, so there is no cold-start gap. **Benefits** -- Low Recovery Time: the surviving region keeps processing while capacity scales up. +- Low recovery time: the surviving region keeps processing while capacity scales up. - Balanced cost: roughly half the fleet runs in each region during normal operation. **Tradeoffs** @@ -310,6 +319,10 @@ Unlike Active / Cold, Workflows keep processing in the surviving region while ca - The secondary region pays cross-region latency, because its polls are forwarded to the active replica. This can be a problem for latency-sensitive Workflows. - Synchronizing external systems is harder, because Workers are active in both regions at once. +**Recommendations and important constraints** + +- Keep forwarding enabled (the default) so the secondary-region Workers' polls reach the active replica. Do not set `disablePassivePollerForwarding`. + **Component behavior** - **Workers** — run and process in both regions; the secondary region's polls are forwarded to the active replica. @@ -319,7 +332,7 @@ Unlike Active / Cold, Workflows keep processing in the surviving region while ca ### Dual Active (Multi-Active) {/* #dual-active */} -Some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. +Beyond the three main patterns, some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. **Steady state** @@ -396,11 +409,11 @@ Workloads on Temporal rarely need this. It pays off only when a workload is *bot ## Choose a deployment model {/* #choose */} -| Model | Recovery Time | Steady-state cost | Best when | +| Model | Recovery time | Steady-state cost | Best when | | --- | --- | --- | --- | -| **Active / Cold** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operating model. | -| **Active / Hot** | Lowest (warm, no DNS wait) | Higher (idle fleet) | The lowest Recovery Time is required and the data plane is pinned to one region at a time. | -| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low Recovery Time at balanced cost, where the secondary region can tolerate cross-region latency. | +| **Active / Passive (Cold)** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operating model. | +| **Active / Passive (Hot)** | Lowest (warm, no DNS wait) | Higher (idle fleet) | The lowest recovery time is required and the data plane is pinned to one region at a time. | +| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low recovery time at balanced cost, where the secondary region can tolerate cross-region latency. | | **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | Low-latency, region-bound data is genuinely required in each region. Rare. | ## The rest of the architecture {/* #rest-of-architecture */} @@ -408,7 +421,7 @@ Workloads on Temporal rarely need this. It pays off only when a workload is *bot The Worker model sets the pattern; the supporting pieces follow it. - **Workflow starters and Clients.** Deploy these with the same regional pattern as the Workers, since a starter or Client often shares the same in-region dependencies (databases, queues, upstream services) and should fail over alongside them. Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover, and use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when a Client must be pinned to a region. -- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Cold, scale them up in the secondary region as part of a failover; in the hot and active/active models, run them in both regions at all times. +- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active models, run them in both regions at all times. - **External databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker model: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. ## Related {/* #related */} @@ -417,7 +430,7 @@ To add a replica and turn on High Availability features, see [Enable and manage To choose between the Namespace Endpoint and Regional Endpoints and to set up private connectivity, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). -To stop forwarding Worker polls to the active region for the Active / Hot model, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To stop forwarding Worker polls to the active region for the Active / Passive (Hot) model, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). To trigger and manage failovers, see [Failovers](/cloud/high-availability/failovers). From 6722da6f4c6c62d7bc09b7c70c3a6376b6de13b8 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 11 Jun 2026 19:15:34 -0700 Subject: [PATCH 6/9] Add High Availability deployment patterns docs page Co-Authored-By: Claude Opus 4.8 --- ...ent-models.mdx => deployment-patterns.mdx} | 125 ++++++++++-------- docs/cloud/high-availability/enable.mdx | 2 +- .../high-availability/failovers/manage.mdx | 2 +- .../high-availability/ha-connectivity.mdx | 4 +- docs/cloud/high-availability/index.mdx | 4 +- sidebars.js | 2 +- 6 files changed, 78 insertions(+), 61 deletions(-) rename docs/cloud/high-availability/{deployment-models.mdx => deployment-patterns.mdx} (71%) diff --git a/docs/cloud/high-availability/deployment-models.mdx b/docs/cloud/high-availability/deployment-patterns.mdx similarity index 71% rename from docs/cloud/high-availability/deployment-models.mdx rename to docs/cloud/high-availability/deployment-patterns.mdx index b3574e2114..a43e47e264 100644 --- a/docs/cloud/high-availability/deployment-models.mdx +++ b/docs/cloud/high-availability/deployment-patterns.mdx @@ -1,9 +1,9 @@ --- -id: deployment-models -title: Deployment models for High Availability -sidebar_label: Deployment models for High Availability -slug: /cloud/high-availability/deployment-models -description: Choose a Worker deployment model — Active / Passive (Cold), Active / Passive (Hot), or Active / Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. +id: deployment-patterns +title: Deployment patterns for High Availability +sidebar_label: Deployment patterns for High Availability +slug: /cloud/high-availability/deployment-patterns +description: Choose a Worker deployment pattern — Active / Passive (Cold), Active / Passive (Hot), or Active / Active — for a Namespace with Temporal Cloud High Availability features, and understand how the rest of the architecture fails over with it. tags: - Temporal Cloud - High Availability @@ -17,22 +17,35 @@ keywords: - temporal cloud --- -A Namespace with [High Availability features](/cloud/high-availability) fails over the Temporal Service automatically, but it does not move the rest of the architecture. -On failover, Temporal Cloud promotes the replica to active and reroutes the [Namespace Endpoint](/cloud/namespaces#access-namespaces). Workers, Workflow starters, Codec Servers, and the external systems that Workflows depend on each need their own failover story. +When an outage strikes, a Namespace with [High Availability](/cloud/high-availability) fails over to another region automatically, but it does not move the rest of the architecture. +Workers, Workflow starters, Codec Servers, databases, and the external systems that Workflows depend on each need their own failover story. -A critical piece of the [recovery time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment model**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. -This page describes common patterns for deploying Workers and how the rest of the architecture fits into an overall High Availability strategy. +A critical piece of the [recovery time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment pattern**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. +This page describes common patterns for deploying Workers and the rest of the architecture to achieve an overall High Availability strategy. ## Terminology {/* #terminology */} -This page uses two terms for the regions of a Namespace with High Availability: +This page presumes familiarity with [High Availability for Temporal Cloud Namespaces](/cloud/high-availability), including replicas, active and passive regions, replication, and failover. + +It uses two terms for the regions of a Namespace with High Availability: - **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." - **Secondary region** — the region the Namespace fails over to. It holds a replica and is passive during normal operation. +It also uses these names for the Worker deployment patterns, each detailed in [Worker deployment patterns](#worker-deployment-patterns): + +- **Active / Passive** — Workflows process in one region at a time. It has two variants: + - **[Active / Cold](#active-cold)** — Workers run in one region at a time and Workflows process in one region, both the user's responsibility to enforce. After a failover, Workers start in the secondary region. + - **[Active / Hot](#active-hot)** — Workers run in both regions, but the system guarantees Workflows process in only the active region. +- **[Active / Active](#active-active)** — Workers run in both regions, and Workflows process in both regions at the same time. + :::info -**Namespaces are always active / passive, but can support an Active / Active pattern.** A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. However, Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. +**Namespaces are always Active / Passive, but can support an Active / Active pattern.** + +A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. + +However, since both regions can serve requests and Worker polls, Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. ::: @@ -56,11 +69,13 @@ The [Worker deployment patterns](#worker-deployment-patterns) below note when ea This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. They trade off **recovery time** after an outage, **steady-state cost**, and **operational complexity**. -:::tip Temporal's recommendation +Temporal recommends starting with **Active / Passive (Cold)** and shifting to another pattern when business requirements warrant it. -Start with **Active / Passive (Cold)**. It is the easiest to set up, the easiest to reason about, and the lowest cost. Move to another pattern only when business requirements warrant it. - -::: +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | +| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | +| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no idle standby) | Cross-region requests add Workflow latency | The diagrams below use a shared visual language: @@ -78,7 +93,7 @@ Workers run **only in the primary region**. The secondary region holds the passi **Steady state** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; @@ -89,21 +104,23 @@ flowchart LR direction TB W1["Workers
processing"]:::wactive DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud + R1["Namespace replica
ACTIVE"]:::tcloud end subgraph S["SECONDARY REGION"] direction TB - R2["Namespace replica
passive"]:::tpassive + DB2["External DB / queue"]:::ext + R2["Namespace replica
PASSIVE"]:::tpassive end W1 --> R1 W1 <--> DB1 + DB1 <-->|replication (if needed)| DB2 R1 -. replicates .-> R2 ``` **Failover** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; @@ -119,8 +136,8 @@ flowchart LR subgraph S["SECONDARY REGION"] direction TB W2["Workers
cold start"]:::wnew - DB2["External DB / queue
promoted"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud + DB2["External DB / queue
PROMOTED (if needed)"]:::ext + R2["Namespace replica
ACTIVE"]:::tcloud end W2 --> R2 W2 <--> DB2 @@ -132,13 +149,13 @@ Recovery time includes container or VM startup, image pulls, and application war **Benefits** -- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the model pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. -- Simplest model to operate; in steady state it resembles a single-region deployment. +- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- Simplest pattern to operate; in steady state it resembles a single-region deployment. - Lowest steady-state cost: a single Worker fleet. **Tradeoffs** -- Highest recovery time of the models here, gated by Worker startup in the secondary region. +- Highest recovery time of the patterns here, gated by Worker startup in the secondary region. - Depends on tested automation to bring up the secondary-region fleet quickly. **Recommendations and important constraints** @@ -168,7 +185,7 @@ With forwarding disabled, polls that reach the passive replica are not sent to t **Steady state** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; @@ -179,13 +196,13 @@ flowchart LR direction TB W1["Workers
processing"]:::wactive DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud + R1["Namespace replica
ACTIVE"]:::tcloud end subgraph S["SECONDARY REGION"] direction TB W2["Workers
connected, idle"]:::widle DB2["External DB / queue
standby"]:::ext - R2["Namespace replica
passive"]:::tpassive + R2["Namespace replica
PASSIVE"]:::tpassive end W1 --> R1 W1 <--> DB1 @@ -197,7 +214,7 @@ flowchart LR **Failover** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; @@ -211,19 +228,19 @@ flowchart LR subgraph S["SECONDARY REGION"] direction TB W2["Workers
now processing"]:::wactive - DB2["External DB / queue
promoted"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud + DB2["External DB / queue
PROMOTED (if needed)"]:::ext + R2["Namespace replica
ACTIVE"]:::tcloud end W2 --> R2 W2 <--> DB2 style P fill:#ED360E14,stroke:#ED360E ``` -Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this model achieves the lowest recovery time. +Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this pattern achieves the lowest recovery time. **Benefits** -- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the model pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. - Lowest recovery time: the secondary-region Workers are already connected and warm. - Low steady-state latency: Tasks are processed only in the active region, with no cross-region forwarding. @@ -252,7 +269,7 @@ Because the passive replica transparently forwards requests to and from the acti **Steady state** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; @@ -262,13 +279,13 @@ flowchart LR direction TB W1["Workers
processing"]:::wactive DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud + R1["Namespace replica
ACTIVE"]:::tcloud end subgraph S["SECONDARY REGION"] direction TB W2["Workers
processing"]:::wactive DB2["External DB / queue"]:::ext - R2["Namespace replica
passive"]:::tpassive + R2["Namespace replica
PASSIVE"]:::tpassive end W1 --> R1 W1 <--> DB1 @@ -281,7 +298,7 @@ flowchart LR **Failover** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; @@ -298,7 +315,7 @@ flowchart LR W2["Workers
keep processing"]:::wactive W3["Workers
scaled up"]:::wnew DB2["External DB / queue"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud + R2["Namespace replica
ACTIVE"]:::tcloud end W2 --> R2 W3 --> R2 @@ -337,7 +354,7 @@ Beyond the three main patterns, some architectures need low-latency or region-bo **Steady state** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; @@ -345,14 +362,14 @@ flowchart LR subgraph R1["REGION 1"] direction TB WA["App A Workers"]:::wactive - NAa["Namespace A
ACTIVE"]:::tcloud - NBp["Namespace B
passive"]:::tpassive + NAa["Namespace A
ACTIVE"]:::tcloud + NBp["Namespace B
PASSIVE"]:::tpassive end subgraph R2["REGION 2"] direction TB WB["App B Workers"]:::wactive - NBa["Namespace B
ACTIVE"]:::tcloud - NAp["Namespace A
passive"]:::tpassive + NBa["Namespace B
ACTIVE"]:::tcloud + NAp["Namespace A
PASSIVE"]:::tpassive end WA --> NAa WB --> NBa @@ -363,7 +380,7 @@ flowchart LR **Failover (Region 1 outage)** ```mermaid -%%{init: {'themeVariables':{'fontFamily':'"Noto Sans Mono", ui-monospace, monospace'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; @@ -378,8 +395,8 @@ flowchart LR direction TB WB["App B Workers"]:::wactive WA2["App A Workers
brought up"]:::wnew - NBa["Namespace B
ACTIVE"]:::tcloud - NAa2["Namespace A
ACTIVE"]:::tcloud + NBa["Namespace B
ACTIVE"]:::tcloud + NAa2["Namespace A
ACTIVE"]:::tcloud end WB --> NBa WA2 --> NAa2 @@ -398,7 +415,7 @@ Workloads on Temporal rarely need this. It pays off only when a workload is *bot **Tradeoffs** - Highest cost and operational complexity: two Worker fleets and two Namespaces. -- Rarely justified. Temporal recommends modeling each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. +- Rarely justified. Temporal recommends treating each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. **Component behavior** @@ -407,22 +424,22 @@ Workloads on Temporal rarely need this. It pays off only when a workload is *bot - **Codec Servers and proxies** — run in both regions, for both Namespaces. - **External databases and queues** — region-bound per application; each fails over with its Namespace. -## Choose a deployment model {/* #choose */} +## Choose a deployment pattern {/* #choose */} -| Model | Recovery time | Steady-state cost | Best when | +| Pattern | Recovery time | Steady-state cost | Best when | | --- | --- | --- | --- | -| **Active / Passive (Cold)** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operating model. | +| **Active / Passive (Cold)** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operations. | | **Active / Passive (Hot)** | Lowest (warm, no DNS wait) | Higher (idle fleet) | The lowest recovery time is required and the data plane is pinned to one region at a time. | | **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low recovery time at balanced cost, where the secondary region can tolerate cross-region latency. | | **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | Low-latency, region-bound data is genuinely required in each region. Rare. | ## The rest of the architecture {/* #rest-of-architecture */} -The Worker model sets the pattern; the supporting pieces follow it. +The Worker deployment pattern sets the approach; the supporting pieces follow it. - **Workflow starters and Clients.** Deploy these with the same regional pattern as the Workers, since a starter or Client often shares the same in-region dependencies (databases, queues, upstream services) and should fail over alongside them. Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover, and use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when a Client must be pinned to a region. -- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active models, run them in both regions at all times. -- **External databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker model: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. +- **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active patterns, run them in both regions at all times. +- **External databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker deployment pattern: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. ## Related {/* #related */} @@ -430,8 +447,8 @@ To add a replica and turn on High Availability features, see [Enable and manage To choose between the Namespace Endpoint and Regional Endpoints and to set up private connectivity, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). -To stop forwarding Worker polls to the active region for the Active / Passive (Hot) model, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). +To stop forwarding Worker polls to the active region for the Active / Passive (Hot) pattern, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). To trigger and manage failovers, see [Failovers](/cloud/high-availability/failovers). -To understand the recovery objectives each model is measured against, see [RPO and RTO](/cloud/rpo-rto). +To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto). diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index fb504011cd..57535551a9 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -137,7 +137,7 @@ Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activit Same-region replicas are not affected by this setting. -To deploy Worker fleets in both regions that stay idle in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/deployment-models#active-hot). +To deploy Worker fleets in both regions that stay idle in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). :::info diff --git a/docs/cloud/high-availability/failovers/manage.mdx b/docs/cloud/high-availability/failovers/manage.mdx index 7d126ea673..c9fecc63ee 100644 --- a/docs/cloud/high-availability/failovers/manage.mdx +++ b/docs/cloud/high-availability/failovers/manage.mdx @@ -174,7 +174,7 @@ the replica, the DNS redirection orchestrated by Temporal ensures that your exis Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and the responses back, so Workers keep running through a failover. -To choose where your Worker fleets run across regions, see [Deployment models for High Availability](/cloud/high-availability/deployment-models). +To choose where your Worker fleets run across regions, see [Deployment patterns for High Availability](/cloud/high-availability/deployment-patterns). To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica). diff --git a/docs/cloud/high-availability/ha-connectivity.mdx b/docs/cloud/high-availability/ha-connectivity.mdx index 0942b0a12a..291a9f37ee 100644 --- a/docs/cloud/high-availability/ha-connectivity.mdx +++ b/docs/cloud/high-availability/ha-connectivity.mdx @@ -93,9 +93,9 @@ To learn what forwarding does, see [Request forwarding](/cloud/high-availability To stop forwarding Worker polls on a Namespace, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). -To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-models#active-active). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). -To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-models#active-hot). +To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). ## How to use PrivateLink with High Availability features diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index 919bac0082..0b6b1bb180 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -105,9 +105,9 @@ To route Workers to the passive region's replica, see [How requests reach the re To disable passive region replica forwarding, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior). -To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-models#active-active). +To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). -To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-models#active-hot). +To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). ## Service levels and recovery objectives diff --git a/sidebars.js b/sidebars.js index 697056b673..db44fad11b 100644 --- a/sidebars.js +++ b/sidebars.js @@ -1202,7 +1202,7 @@ module.exports = { }, items: [ 'cloud/high-availability/enable', - 'cloud/high-availability/deployment-models', + 'cloud/high-availability/deployment-patterns', 'cloud/high-availability/monitoring', { type: 'category', From 24c91518655d4144e11203d030246f3669cb8cf7 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Mon, 15 Jun 2026 13:57:02 -0700 Subject: [PATCH 7/9] Updates to worker deployment patterns --- .../high-availability/deployment-patterns.mdx | 736 ++++++++++++------ docs/cloud/high-availability/enable.mdx | 2 +- .../high-availability/ha-connectivity.mdx | 2 +- docs/cloud/high-availability/index.mdx | 2 +- 4 files changed, 510 insertions(+), 232 deletions(-) diff --git a/docs/cloud/high-availability/deployment-patterns.mdx b/docs/cloud/high-availability/deployment-patterns.mdx index a43e47e264..34a86f9b89 100644 --- a/docs/cloud/high-availability/deployment-patterns.mdx +++ b/docs/cloud/high-availability/deployment-patterns.mdx @@ -23,66 +23,262 @@ Workers, Workflow starters, Codec Servers, databases, and the external systems t A critical piece of the [recovery time](/cloud/rpo-rto) achieved in a real-world outage is the **Worker deployment pattern**: where Worker fleets run and which region (or regions) processes Workflows at any given moment. This page describes common patterns for deploying Workers and the rest of the architecture to achieve an overall High Availability strategy. -## Terminology {/* #terminology */} +## What needs a failover story {/* #what-needs-a-failover-story */} -This page presumes familiarity with [High Availability for Temporal Cloud Namespaces](/cloud/high-availability), including replicas, active and passive regions, replication, and failover. +Beyond the Namespace itself, these components live in the application environment and must be planned for: -It uses two terms for the regions of a Namespace with High Availability: +- **Workers** — execute Workflows and Activities. +- **Workflow starters and Clients** — start and signal Workflows. +- **Codec Servers** — encode and decode payloads for Workers, the Web UI, and the CLI. +- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. +- **External databases and queues** — the systems that Activities read and write. -- **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." -- **Secondary region** — the region the Namespace fails over to. It holds a replica and is passive during normal operation. +Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, external databases). +Because the right choice for each of these usually depends on where Workers run, **this page focuses on Worker deployment patterns**. -It also uses these names for the Worker deployment patterns, each detailed in [Worker deployment patterns](#worker-deployment-patterns): +## Worker deployment patterns {/* #worker-deployment-patterns */} + +This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. +They trade off **recovery time** after an outage, **cost during normal operation**, and **operational complexity**, and differ in where the Workers run and where Workflows process: - **Active / Passive** — Workflows process in one region at a time. It has two variants: - **[Active / Cold](#active-cold)** — Workers run in one region at a time and Workflows process in one region, both the user's responsibility to enforce. After a failover, Workers start in the secondary region. - **[Active / Hot](#active-hot)** — Workers run in both regions, but the system guarantees Workflows process in only the active region. - **[Active / Active](#active-active)** — Workers run in both regions, and Workflows process in both regions at the same time. -:::info +:::tip -**Namespaces are always Active / Passive, but can support an Active / Active pattern.** +To learn more about High Availability for Temporal Cloud Namespaces — including replicas, active and passive regions, replication, and failover — see [High Availability for Temporal Cloud Namespaces](/cloud/high-availability). -A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. +::: -However, since both regions can serve requests and Worker polls, Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. +These patterns work across two cloud regions, which could be in the same cloud provider or different cloud providers: -::: +- **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." +- **Secondary region** — the region the Namespace fails over to. It holds a replica and is passive during normal operation. -A useful property to keep in mind: **Workers don't need to run in the same region as the active replica.** A Worker fleet in one region can poll a Namespace that is active in another. +**Benefits and tradeoffs at a glance** -## What needs a failover story {/* #what-needs-a-failover-story */} +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | +| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | +| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no standby Workers) | Cross-region requests add Workflow latency | -Beyond the Namespace itself, these components live in the application environment and must be planned for: +**Active / Passive (Cold): normal operation and failover** -- **Workers** — execute Workflows and Activities. -- **Workflow starters and Clients** — start and signal Workflows. -- **Codec Servers** — encode and decode payloads for Workers, the Web UI, and the CLI. -- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. -- **External databases and queues** — the systems that Activities read and write. +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef empty fill:transparent,stroke:#9aa4b2,stroke-width:1px,stroke-dasharray:4 3,color:#9aa4b2; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph CPRIM["Primary"] + subgraph CWP["Worker Pool"] + CW1["Worker"]:::worker + CW2["Worker"]:::worker + CW3["Worker"]:::worker + end + CNS["Namespace"]:::ns + CWP <-->|Workflows| CNS + end + subgraph CSEC["Secondary"] + CR["Replica"]:::ns + subgraph CWP2["Worker Pool"] + CE["      Empty      "]:::empty + end + CR ~~~ CWP2 + end + CNS --> CR + class CPRIM,CSEC region + class CWP,CWP2 pool +``` -Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, external databases). -The [Worker deployment patterns](#worker-deployment-patterns) below note when each piece needs to be running ahead of time versus scaled up after a failover. +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph FPRIM["Primary (outage)"] + subgraph FPP["Worker Pool"] + FPW1["Unavailable"]:::down + end + FPN["Namespace"]:::down + FPP ~~~ FPN + end + subgraph FSEC["Secondary"] + FSN["Namespace
(Active)"]:::ns + subgraph FSP["Worker Pool"] + FSW1["Worker
Cold start"]:::worker + FSW2["Worker
Cold start"]:::worker + FSW3["Worker
Cold start"]:::worker + end + FSN <-->|Workflows| FSP + end + FPN -->|"Failover"| FSN + class FPRIM regiondown + class FSEC region + class FPP,FSP pool +``` -## Worker deployment patterns {/* #worker-deployment-patterns */} +**Active / Passive (Hot): normal operation and failover** -This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. -They trade off **recovery time** after an outage, **steady-state cost**, and **operational complexity**. +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph HPRIM["Primary"] + subgraph HWP["Worker Pool"] + HW1["Worker
Active"]:::worker + HW2["Worker
Active"]:::worker + HW3["Worker
Active"]:::worker + end + HNS["Namespace"]:::ns + HWP <-->|Workflows| HNS + end + subgraph HSEC["Secondary"] + HR["Replica"]:::ns + subgraph HWP2["Worker Pool"] + HS1["Worker
Standby"]:::worker + HS2["Worker
Standby"]:::worker + HS3["Worker
Standby"]:::worker + end + HR <-->|"Connected"| HWP2 + end + HNS --> HR + class HPRIM,HSEC region + class HWP,HWP2 pool +``` -Temporal recommends starting with **Active / Passive (Cold)** and shifting to another pattern when business requirements warrant it. +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph HFPRIM["Primary (outage)"] + subgraph HFPP["Worker Pool"] + HFPW1["Unavailable"]:::down + end + HFPN["Namespace"]:::down + HFPP ~~~ HFPN + end + subgraph HFSEC["Secondary"] + HFSN["Namespace
(Active)"]:::ns + subgraph HFSP["Worker Pool"] + HFSW1["Worker
Active"]:::worker + HFSW2["Worker
Active"]:::worker + HFSW3["Worker
Active"]:::worker + end + HFSN <-->|Workflows| HFSP + end + HFPN -->|"Failover"| HFSN + class HFPRIM regiondown + class HFSEC region + class HFPP,HFSP pool +``` -| Pattern | Best for | Major benefits | Major tradeoffs | -| --- | --- | --- | --- | -| **[Active / Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | -| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | -| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no idle standby) | Cross-region requests add Workflow latency | +**Active / Active: normal operation and failover** + +```mermaid +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph APRIM["Primary"] + subgraph AWP["Worker Pool"] + AW1["Worker
Active"]:::worker + AW2["Worker
Active"]:::worker + end + ANS["Namespace"]:::ns + AWP <-->|Workflows| ANS + end + subgraph ASEC["Secondary"] + AR["Replica"]:::ns + subgraph AWP2["Worker Pool"] + AS1["Worker
Active"]:::worker + AS2["Worker
Active"]:::worker + end + AR <-->|Workflows| AWP2 + end + ANS --> AR + class APRIM,ASEC region + class AWP,AWP2 pool +``` + +```mermaid +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph AFPRIM["Primary (outage)"] + subgraph AFPP["Worker Pool"] + AFPW1["Unavailable"]:::down + end + AFPN["Namespace"]:::down + AFPP ~~~ AFPN + end + subgraph AFSEC["Secondary"] + AFSN["Namespace
(Active)"]:::ns + subgraph AFSP["Worker Pool"] + AFSW1["Worker
Active"]:::worker + AFSW2["Worker
Active"]:::worker + AFSW3["Worker
Scaled up
(as needed)"]:::workerhollow + end + AFSN <-->|Workflows| AFSP + end + AFPN -->|"Failover"| AFSN + class AFPRIM regiondown + class AFSEC region + class AFPP,AFSP pool +``` + +:::info + +**Namespaces are always Active / Passive, but can support an Active / Active pattern.** + +A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. -The diagrams below use a shared visual language: +However, since both regions can serve requests and Worker polls, **Workers don't need to run in the same region as the active replica**, and Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. -- A green border marks the **active** Temporal Cloud replica and the Workers processing against it. -- A muted dashed border marks the **passive** replica; a gold dashed border marks **idle** standby Workers. -- A purple fill marks application-owned systems (Workers, databases, queues). -- A red tint marks the region that is **down** during a failover. +::: ### Active / Passive (Cold) {/* #active-cold */} @@ -90,58 +286,79 @@ _Also known as Active / Cold Standby, Active / Cold, or simply Active / Passive. Workers run **only in the primary region**. The secondary region holds the passive replica but runs none of the application's Workers. -**Steady state** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - subgraph P["PRIMARY REGION"] - direction TB - W1["Workers
processing"]:::wactive - DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud - end - subgraph S["SECONDARY REGION"] - direction TB - DB2["External DB / queue"]:::ext - R2["Namespace replica
PASSIVE"]:::tpassive - end - W1 --> R1 - W1 <--> DB1 - DB1 <-->|replication (if needed)| DB2 - R1 -. replicates .-> R2 + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef empty fill:transparent,stroke:#9aa4b2,stroke-width:1px,stroke-dasharray:4 3,color:#9aa4b2; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DCNPRI["Primary"] + subgraph DCNWP["Worker Pool"] + DCNW1["Worker"]:::worker + DCNW2["Worker"]:::worker + DCNW3["Worker"]:::worker + end + DCNNS["Namespace"]:::ns + DCNDB[("External DB / queue")]:::ext + DCNWP <-->|Workflows| DCNNS + DCNWP <--> DCNDB + end + subgraph DCNSEC["Secondary"] + DCNR["Replica"]:::ns + DCNDB2[("External DB / queue")]:::ext + subgraph DCNWP2["Worker Pool"] + DCNE["      Empty      "]:::empty + end + DCNR ~~~ DCNWP2 + end + DCNNS -. replicates .-> DCNR + DCNDB <-.->|"replication (if needed)"| DCNDB2 + class DCNPRI,DCNSEC region + class DCNWP,DCNWP2 pool ``` -**Failover** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; - subgraph P["PRIMARY REGION — OUTAGE"] - direction TB - W1["Workers
unavailable"]:::down - DB1["External DB / queue"]:::down - R1["Namespace replica"]:::down - end - subgraph S["SECONDARY REGION"] - direction TB - W2["Workers
cold start"]:::wnew - DB2["External DB / queue
PROMOTED (if needed)"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud - end - W2 --> R2 - W2 <--> DB2 - style P fill:#ED360E14,stroke:#ED360E + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DCFPRI["Primary (outage)"] + subgraph DCFWP["Worker Pool"] + DCFW1["Unavailable"]:::down + end + DCFNS["Namespace"]:::down + DCFDB[("External DB / queue")]:::down + DCFWP ~~~ DCFNS + end + subgraph DCFSEC["Secondary"] + DCFN["Namespace
(Active)"]:::ns + subgraph DCFSP["Worker Pool"] + DCFSW1["Worker
Cold start"]:::worker + DCFSW2["Worker
Cold start"]:::worker + DCFSW3["Worker
Cold start"]:::worker + end + DCFDB2[("External DB / queue
Promoted (if needed)")]:::ext + DCFN <-->|Workflows| DCFSP + DCFSP <--> DCFDB2 + end + DCFNS -->|"Failover"| DCFN + class DCFPRI regiondown + class DCFSEC region + class DCFWP,DCFSP pool ``` On failover, the Namespace is active in the secondary region immediately, but the Workers there start from nothing, a "cold" start. @@ -150,8 +367,8 @@ Recovery time includes container or VM startup, image pulls, and application war **Benefits** - **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. -- Simplest pattern to operate; in steady state it resembles a single-region deployment. -- Lowest steady-state cost: a single Worker fleet. +- Simplest pattern to operate; during normal operation it resembles a single-region deployment. +- Lowest cost during normal operation: a single Worker fleet. **Tradeoffs** @@ -177,76 +394,97 @@ Recovery time includes container or VM startup, image pulls, and application war _Also known as Active / Hot Standby or Active / Hot._ -Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet idle. +Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet on standby. This is achieved by disabling forwarding for Worker polls and connecting each fleet to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity). -With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the idle fleet does no work and adds no cross-region overhead. - -**Steady state** +With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the standby fleet does no work and adds no cross-region overhead. ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef widle fill:#7C3AED18,stroke:#FECB2F,stroke-width:1px,stroke-dasharray:4 3; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - subgraph P["PRIMARY REGION"] - direction TB - W1["Workers
processing"]:::wactive - DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud - end - subgraph S["SECONDARY REGION"] - direction TB - W2["Workers
connected, idle"]:::widle - DB2["External DB / queue
standby"]:::ext - R2["Namespace replica
PASSIVE"]:::tpassive - end - W1 --> R1 - W1 <--> DB1 - W2 -. idle .-> R2 - W2 <--> DB2 - R1 -. replicates .-> R2 + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DHNPRI["Primary"] + subgraph DHNWP["Worker Pool"] + DHNW1["Worker
Active"]:::worker + DHNW2["Worker
Active"]:::worker + DHNW3["Worker
Active"]:::worker + end + DHNNS["Namespace"]:::ns + DHNDB[("External DB / queue")]:::ext + DHNWP <-->|Workflows| DHNNS + DHNWP <--> DHNDB + end + subgraph DHNSEC["Secondary"] + DHNR["Replica"]:::ns + subgraph DHNWP2["Worker Pool"] + DHNSW1["Worker
Standby"]:::worker + DHNSW2["Worker
Standby"]:::worker + DHNSW3["Worker
Standby"]:::worker + end + DHNDB2[("External DB / queue
Standby")]:::ext + DHNR <-->|"Connected"| DHNWP2 + DHNWP2 <--> DHNDB2 + end + DHNNS -. replicates .-> DHNR + class DHNPRI,DHNSEC region + class DHNWP,DHNWP2 pool ``` -**Failover** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; - subgraph P["PRIMARY REGION — OUTAGE"] - direction TB - W1["Workers"]:::down - R1["Namespace replica"]:::down - end - subgraph S["SECONDARY REGION"] - direction TB - W2["Workers
now processing"]:::wactive - DB2["External DB / queue
PROMOTED (if needed)"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud - end - W2 --> R2 - W2 <--> DB2 - style P fill:#ED360E14,stroke:#ED360E + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DHFPRI["Primary (outage)"] + subgraph DHFWP["Worker Pool"] + DHFW1["Unavailable"]:::down + end + DHFNS["Namespace"]:::down + DHFWP ~~~ DHFNS + end + subgraph DHFSEC["Secondary"] + DHFN["Namespace
(Active)"]:::ns + subgraph DHFSP["Worker Pool"] + DHFSW1["Worker
Active"]:::worker + DHFSW2["Worker
Active"]:::worker + DHFSW3["Worker
Active"]:::worker + end + DHFDB2[("External DB / queue
Promoted (if needed)")]:::ext + DHFN <-->|Workflows| DHFSP + DHFSP <--> DHFDB2 + end + DHFNS -->|"Failover"| DHFN + class DHFPRI regiondown + class DHFSEC region + class DHFWP,DHFSP pool ``` -Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously idle fleet begins processing the moment the secondary region becomes active, so this pattern achieves the lowest recovery time. +Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously standby fleet begins processing the moment the secondary region becomes active, so this pattern achieves the lowest recovery time. **Benefits** - **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. - Lowest recovery time: the secondary-region Workers are already connected and warm. -- Low steady-state latency: Tasks are processed only in the active region, with no cross-region forwarding. +- Low latency during normal operation: Tasks are processed only in the active region, with no cross-region forwarding. **Tradeoffs** -- Highest steady-state cost: idle Worker capacity runs in the secondary region at all times. +- Highest cost during normal operation: standby Worker capacity runs in the secondary region at all times. **Recommendations and important constraints** @@ -266,61 +504,79 @@ Workers run in **both regions and process Workflows at the same time**, with for A Temporal Cloud Namespace is not "active/active" in the database sense; it still has a single active replica in one region. Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls are forwarded across regions to the active replica. -**Steady state** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - subgraph P["PRIMARY REGION"] - direction TB - W1["Workers
processing"]:::wactive - DB1["External DB / queue"]:::ext - R1["Namespace replica
ACTIVE"]:::tcloud - end - subgraph S["SECONDARY REGION"] - direction TB - W2["Workers
processing"]:::wactive - DB2["External DB / queue"]:::ext - R2["Namespace replica
PASSIVE"]:::tpassive - end - W1 --> R1 - W1 <--> DB1 - W2 --> R2 - R2 ==>|forwards polls| R1 - W2 <--> DB2 - R1 -. replicates .-> R2 + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DANPRI["Primary"] + subgraph DANWP["Worker Pool"] + DANW1["Worker
Active"]:::worker + DANW2["Worker
Active"]:::worker + end + DANNS["Namespace"]:::ns + DANDB[("External DB / queue")]:::ext + DANWP <-->|Workflows| DANNS + DANWP <--> DANDB + end + subgraph DANSEC["Secondary"] + DANR["Replica"]:::ns + subgraph DANWP2["Worker Pool"] + DANS1["Worker
Active"]:::worker + DANS2["Worker
Active"]:::worker + end + DANDB2[("External DB / queue")]:::ext + DANWP2 <-->|Workflows| DANR + DANWP2 <--> DANDB2 + end + DANNS -. replicates .-> DANR + DANR ==>|"forwards polls"| DANNS + class DANPRI,DANSEC region + class DANWP,DANWP2 pool ``` -**Failover** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: After failover +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; - subgraph P["PRIMARY REGION — OUTAGE"] - direction TB - W1["Workers"]:::down - R1["Namespace replica"]:::down - end - subgraph S["SECONDARY REGION"] - direction TB - W2["Workers
keep processing"]:::wactive - W3["Workers
scaled up"]:::wnew - DB2["External DB / queue"]:::ext - R2["Namespace replica
ACTIVE"]:::tcloud - end - W2 --> R2 - W3 --> R2 - W2 <--> DB2 - style P fill:#ED360E14,stroke:#ED360E + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DAFPRI["Primary (outage)"] + subgraph DAFWP["Worker Pool"] + DAFW1["Unavailable"]:::down + end + DAFNS["Namespace"]:::down + DAFWP ~~~ DAFNS + end + subgraph DAFSEC["Secondary"] + DAFN["Namespace
(Active)"]:::ns + subgraph DAFSP["Worker Pool"] + DAFSW1["Worker
Active"]:::worker + DAFSW2["Worker
Active"]:::worker + DAFSW3["Worker
Scaled up
(as needed)"]:::workerhollow + end + DAFDB2[("External DB / queue")]:::ext + DAFN <-->|Workflows| DAFSP + DAFSP <--> DAFDB2 + end + DAFNS -->|"Failover"| DAFN + class DAFPRI regiondown + class DAFSEC region + class DAFWP,DAFSP pool ``` This is a practical way to reach a low recovery time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. @@ -351,56 +607,78 @@ Unlike Active / Passive (Cold), Workflows keep processing in the surviving regio Beyond the three main patterns, some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. -**Steady state** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: Normal operation +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef tpassive fill:#59FDA012,stroke:#7C8FB1,stroke-width:1px,stroke-dasharray:4 3; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - subgraph R1["REGION 1"] - direction TB - WA["App A Workers"]:::wactive - NAa["Namespace A
ACTIVE"]:::tcloud - NBp["Namespace B
PASSIVE"]:::tpassive - end - subgraph R2["REGION 2"] - direction TB - WB["App B Workers"]:::wactive - NBa["Namespace B
ACTIVE"]:::tcloud - NAp["Namespace A
PASSIVE"]:::tpassive - end - WA --> NAa - WB --> NBa - NAa -. replicates .-> NAp - NBa -. replicates .-> NBp + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DUNR1["Region 1"] + subgraph DUNWPA["App A Worker Pool"] + DUNA1["Worker
Active"]:::worker + DUNA2["Worker
Active"]:::worker + end + DUNNAa["Namespace A"]:::ns + DUNNBp["Namespace B
Replica"]:::ns + DUNWPA <-->|Workflows| DUNNAa + end + subgraph DUNR2["Region 2"] + subgraph DUNWPB["App B Worker Pool"] + DUNB1["Worker
Active"]:::worker + DUNB2["Worker
Active"]:::worker + end + DUNNBa["Namespace B"]:::ns + DUNNAp["Namespace A
Replica"]:::ns + DUNWPB <-->|Workflows| DUNNBa + end + DUNNAa -. replicates .-> DUNNAp + DUNNBa -. replicates .-> DUNNBp + class DUNR1,DUNR2 region + class DUNWPA,DUNWPB pool ``` -**Failover (Region 1 outage)** - ```mermaid -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':45,'rankSpacing':70,'curve':'basis'}}}%% +--- +title: After failover (Region 1 outage) +--- +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR - classDef tcloud fill:#59FDA024,stroke:#59FDA0,stroke-width:2px; - classDef wactive fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px; - classDef wnew fill:#7C3AED22,stroke:#59FDA0,stroke-width:2px,stroke-dasharray:5 3; - classDef down fill:#ED360E22,stroke:#ED360E,stroke-width:1px,stroke-dasharray:2 2; - subgraph R1["REGION 1 — OUTAGE"] - direction TB - WA["App A Workers"]:::down - NAa["Namespace A"]:::down - end - subgraph R2["REGION 2"] - direction TB - WB["App B Workers"]:::wactive - WA2["App A Workers
brought up"]:::wnew - NBa["Namespace B
ACTIVE"]:::tcloud - NAa2["Namespace A
ACTIVE"]:::tcloud - end - WB --> NBa - WA2 --> NAa2 - style R1 fill:#ED360E14,stroke:#ED360E + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; + subgraph DUFR1["Region 1 (outage)"] + subgraph DUFWPA["App A Worker Pool"] + DUFA1["Unavailable"]:::down + end + DUFNAa["Namespace A"]:::down + DUFWPA ~~~ DUFNAa + end + subgraph DUFR2["Region 2"] + subgraph DUFWPB["App B Worker Pool"] + DUFB1["Worker
Active"]:::worker + DUFB2["Worker
Active"]:::worker + end + subgraph DUFWPA2["App A Worker Pool"] + DUFA2a["Worker
Brought up"]:::workerhollow + DUFA2b["Worker
Brought up"]:::workerhollow + end + DUFNBa["Namespace B
(Active)"]:::ns + DUFNAa2["Namespace A
(Active)"]:::ns + DUFWPB <-->|Workflows| DUFNBa + DUFWPA2 <-->|Workflows| DUFNAa2 + end + DUFNAa -->|"Failover"| DUFNAa2 + class DUFR1 regiondown + class DUFR2 region + class DUFWPA,DUFWPB,DUFWPA2 pool ``` Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. The same idea extends across more than two regions. Each Namespace fails over independently, following the Active / Passive sequence. @@ -426,10 +704,10 @@ Workloads on Temporal rarely need this. It pays off only when a workload is *bot ## Choose a deployment pattern {/* #choose */} -| Pattern | Recovery time | Steady-state cost | Best when | +| Pattern | Recovery time | Normal-operation cost | Best when | | --- | --- | --- | --- | | **Active / Passive (Cold)** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operations. | -| **Active / Passive (Hot)** | Lowest (warm, no DNS wait) | Higher (idle fleet) | The lowest recovery time is required and the data plane is pinned to one region at a time. | +| **Active / Passive (Hot)** | Lowest (warm, no DNS wait) | Higher (standby fleet) | The lowest recovery time is required and the data plane is pinned to one region at a time. | | **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low recovery time at balanced cost, where the secondary region can tolerate cross-region latency. | | **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | Low-latency, region-bound data is genuinely required in each region. Rare. | diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index 57535551a9..921c71d1d5 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -137,7 +137,7 @@ Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activit Same-region replicas are not affected by this setting. -To deploy Worker fleets in both regions that stay idle in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). +To deploy Worker fleets in both regions that stay on standby in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). :::info diff --git a/docs/cloud/high-availability/ha-connectivity.mdx b/docs/cloud/high-availability/ha-connectivity.mdx index 291a9f37ee..58fd7779ee 100644 --- a/docs/cloud/high-availability/ha-connectivity.mdx +++ b/docs/cloud/high-availability/ha-connectivity.mdx @@ -95,7 +95,7 @@ To stop forwarding Worker polls on a Namespace, see [Change the forwarding behav To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). -To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). +To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). ## How to use PrivateLink with High Availability features diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index 0b6b1bb180..b7f737061b 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -107,7 +107,7 @@ To disable passive region replica forwarding, see [Change the forwarding behavio To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/deployment-patterns#active-active). -To keep passive-region Workers idle until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). +To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/deployment-patterns#active-hot). ## Service levels and recovery objectives From 420713a22179a6e00af8ddfd196251347c14a971 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Mon, 15 Jun 2026 15:32:34 -0700 Subject: [PATCH 8/9] updates --- .../high-availability/deployment-patterns.mdx | 468 +++++++++++------- 1 file changed, 284 insertions(+), 184 deletions(-) diff --git a/docs/cloud/high-availability/deployment-patterns.mdx b/docs/cloud/high-availability/deployment-patterns.mdx index 34a86f9b89..cba677ca3c 100644 --- a/docs/cloud/high-availability/deployment-patterns.mdx +++ b/docs/cloud/high-availability/deployment-patterns.mdx @@ -27,51 +27,63 @@ This page describes common patterns for deploying Workers and the rest of the ar Beyond the Namespace itself, these components live in the application environment and must be planned for: -- **Workers** — execute Workflows and Activities. +- **Workers** (the focus of this page) — execute Workflows and Activities. - **Workflow starters and Clients** — start and signal Workflows. - **Codec Servers** — encode and decode payloads for Workers, the Web UI, and the CLI. -- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path. -- **External databases and queues** — the systems that Activities read and write. +- **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path between Workers / Starters / Clients → Namespace. +- **Databases and queues** — the systems that Activities read and write. -Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, external databases). +Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, databases). Because the right choice for each of these usually depends on where Workers run, **this page focuses on Worker deployment patterns**. +:::tip + +See [High Availability for Temporal Cloud Namespaces](/cloud/high-availability) to learn more about Namespace replicas, replication, and failover. + +::: + ## Worker deployment patterns {/* #worker-deployment-patterns */} This page covers three main patterns — **Active / Passive (Cold)**, **Active / Passive (Hot)**, and **Active / Active** — plus a rarely needed **Dual Active** variant. They trade off **recovery time** after an outage, **cost during normal operation**, and **operational complexity**, and differ in where the Workers run and where Workflows process: -- **Active / Passive** — Workflows process in one region at a time. It has two variants: - - **[Active / Cold](#active-cold)** — Workers run in one region at a time and Workflows process in one region, both the user's responsibility to enforce. After a failover, Workers start in the secondary region. - - **[Active / Hot](#active-hot)** — Workers run in both regions, but the system guarantees Workflows process in only the active region. -- **[Active / Active](#active-active)** — Workers run in both regions, and Workflows process in both regions at the same time. +- **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants: + - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region. + - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby. +- **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times. -:::tip +:::info -To learn more about High Availability for Temporal Cloud Namespaces — including replicas, active and passive regions, replication, and failover — see [High Availability for Temporal Cloud Namespaces](/cloud/high-availability). +**Namespaces are always Active / Passive, but can support an Active / Active pattern.** + +A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. + +However, since both regions can serve requests and Worker polls, **Workers don't need to run in the same region as the active replica**, and Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. ::: These patterns work across two cloud regions, which could be in the same cloud provider or different cloud providers: - **Primary region** — the region where the Namespace is active during normal operation, also called the "preferred region." -- **Secondary region** — the region the Namespace fails over to. It holds a replica and is passive during normal operation. +- **Secondary region** — the region the Namespace fails over to. It can be any [Temporal Cloud region](/cloud/regions) that supports replication from the primary region. + +:::tip -**Benefits and tradeoffs at a glance** +Multi-region Replication and Multi-cloud Replication generally use the same set of Worker deployment patterns, so this page will not distinguish between multi-region and multi-cloud. + +::: + +### Compare Worker deployment patterns at a glance (benefits and tradeoffs) {/* #compare-at-a-glance */} | Pattern | Best for | Major benefits | Major tradeoffs | | --- | --- | --- | --- | | **[Active / Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | -| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | -| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no standby Workers) | Cross-region requests add Workflow latency | - -**Active / Passive (Cold): normal operation and failover** ```mermaid --- title: Normal operation --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -103,7 +115,7 @@ flowchart LR --- title: After failover --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -133,13 +145,17 @@ flowchart LR class FPP,FSP pool ``` -**Active / Passive (Hot): normal operation and failover** +--- + +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Passive (Hot)](#active-hot)** | Low RTO with strict single-region behavior | Fast Worker failover; guaranteed to act like a single region | More configuration and higher cost for the Worker fleet | ```mermaid --- title: Normal operation --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -172,7 +188,7 @@ flowchart LR --- title: After failover --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -202,13 +218,17 @@ flowchart LR class HFPP,HFSP pool ``` -**Active / Active: normal operation and failover** +--- + +| Pattern | Best for | Major benefits | Major tradeoffs | +| --- | --- | --- | --- | +| **[Active / Active](#active-active)** | Low RTO with Workers active in multiple regions | Fast Worker failover; uses Worker fleet capacity (no standby Workers) | Cross-region requests add Workflow latency | ```mermaid --- title: Normal operation --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -239,7 +259,7 @@ flowchart LR --- title: After failover --- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; @@ -270,64 +290,52 @@ flowchart LR class AFPP,AFSP pool ``` -:::info - -**Namespaces are always Active / Passive, but can support an Active / Active pattern.** - -A Temporal Cloud Namespace with High Availability has exactly one active region at a time. The other region holds a replica that passively receives replicated state. - -However, since both regions can serve requests and Worker polls, **Workers don't need to run in the same region as the active replica**, and Temporal Cloud Namespaces can still fit into a broader "Active / Active" strategy, as described below. - -::: - ### Active / Passive (Cold) {/* #active-cold */} -_Also known as Active / Cold Standby, Active / Cold, or simply Active / Passive. Recommended and simplest._ +_Also known as "Active / Cold Standby", "Active / Cold", or simply "Active / Passive"._ -Workers run **only in the primary region**. The secondary region holds the passive replica but runs none of the application's Workers. +Active / Cold Pattern: **Normal operation** + +- **Workers run in only one region.** A single Worker fleet runs in the primary region and processes all Workflows. No Workers run in the secondary region. +- **The Namespace replicates to the secondary region.** A Namespace with High Availability has an active replica in the primary region and a passive replica in the secondary region. Temporal Cloud continuously replicates Workflow state to the passive replica, so it stays ready to become active. +- **Your databases and queues replicate too, if needed.** Workers read and write systems such as databases and queues. If your Workflows depend on that data, replicate it to the secondary region so it's available after a failover. Workflows that don't touch external state may not need this. +- **Setup is minimal.** Turn on Replication for your Namespace (see [High Availability for Temporal Cloud Namespaces](/cloud/high-availability)) and enable replication on any databases or queues your Workflows use. At that point you're technically already running Active / Passive (Cold): the secondary region holds a ready replica, and failing over is a matter of bringing your Workers up there. ```mermaid ---- -title: Normal operation ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef empty fill:transparent,stroke:#9aa4b2,stroke-width:1px,stroke-dasharray:4 3,color:#9aa4b2; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; - classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DCNPRI["Primary"] - subgraph DCNWP["Worker Pool"] - DCNW1["Worker"]:::worker - DCNW2["Worker"]:::worker - DCNW3["Worker"]:::worker - end + DCNWP["Workers"]:::worker + DCNCODEC["Codec Server"]:::ext DCNNS["Namespace"]:::ns - DCNDB[("External DB / queue")]:::ext + DCNDB[("DB / queue")]:::ext DCNWP <-->|Workflows| DCNNS DCNWP <--> DCNDB + DCNWP <--> DCNCODEC end subgraph DCNSEC["Secondary"] DCNR["Replica"]:::ns - DCNDB2[("External DB / queue")]:::ext - subgraph DCNWP2["Worker Pool"] - DCNE["      Empty      "]:::empty - end - DCNR ~~~ DCNWP2 + DCNDB2[("DB / queue")]:::ext + DCNR ~~~ DCNDB2 end DCNNS -. replicates .-> DCNR DCNDB <-.->|"replication (if needed)"| DCNDB2 class DCNPRI,DCNSEC region - class DCNWP,DCNWP2 pool ``` +Active / Cold Pattern: **On failover** + +- **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself. +- **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the secondary region's copy active so the new Workers can read and write it. +- **Recovery time is dominated by Worker startup.** After Temporal detects the outage and triggers failover, the Namespace is active almost immediately, but throughput returns to normal only after container or VM startup, image pulls, and application warm-up complete. + ```mermaid ---- -title: After failover ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -335,64 +343,165 @@ flowchart LR classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; - classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DCFPRI["Primary (outage)"] - subgraph DCFWP["Worker Pool"] - DCFW1["Unavailable"]:::down - end DCFNS["Namespace"]:::down - DCFDB[("External DB / queue")]:::down - DCFWP ~~~ DCFNS + DCFDB[("DB / queue
Unavailable")]:::down + DCFNS ~~~ DCFDB end subgraph DCFSEC["Secondary"] DCFN["Namespace
(Active)"]:::ns - subgraph DCFSP["Worker Pool"] - DCFSW1["Worker
Cold start"]:::worker - DCFSW2["Worker
Cold start"]:::worker - DCFSW3["Worker
Cold start"]:::worker - end - DCFDB2[("External DB / queue
Promoted (if needed)")]:::ext + DCFSP["Workers
Cold start"]:::worker + DCFCODEC["Codec Server
Cold start"]:::ext + DCFDB2[("DB / queue
Promoted")]:::ext DCFN <-->|Workflows| DCFSP DCFSP <--> DCFDB2 + DCFSP <--> DCFCODEC end - DCFNS -->|"Failover"| DCFN + DCFPRI -->|"Failover"| DCFSEC class DCFPRI regiondown class DCFSEC region - class DCFWP,DCFSP pool ``` -On failover, the Namespace is active in the secondary region immediately, but the Workers there start from nothing, a "cold" start. -Recovery time includes container or VM startup, image pulls, and application warm-up before throughput returns to normal. +Active / Cold Pattern: **Benefits** -**Benefits** +- **Easy to reason about.** + - Only one region is active at a time, so traffic routing and interactions with systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- **Simple to operate.** + - During normal operation it resembles a single-region deployment. +- **Lowest overall architecture cost.** + - The size of the Worker fleet is simply the capacity needed to operate in one region. There are no standby Workers during steady state. -- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. -- Simplest pattern to operate; during normal operation it resembles a single-region deployment. -- Lowest cost during normal operation: a single Worker fleet. +Active / Cold Pattern: **Tradeoffs** -**Tradeoffs** - -- Highest recovery time of the patterns here, gated by Worker startup in the secondary region. +- Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover. - Depends on tested automation to bring up the secondary-region fleet quickly. -**Recommendations and important constraints** +Active / Cold Pattern: **Recommendations and important constraints** + +- **Failing over the Workers is the operator's responsibility.** The Namespace fails over automatically, but bringing up the Workers in the secondary region is up to you. Plan for these sub-considerations: + - **How do you detect an outage and decide to fail over?** Define the failover conditions and the signals (alerts, health checks) that trigger them. + - **How do you scale up the Workers?** Bring up the secondary-region fleet, ideally with tested automation, and scale down the primary region's fleet so Workers run in only one region at a time. + - **Do you need to enforce single-region processing?** The Cold pattern relies on the operator to keep Workers in one region. To have Temporal enforce single-region processing instead, use the [Active / Passive (Hot)](#active-hot) pattern. -- **Use the Namespace Endpoint.** Connect Workers through the Namespace Endpoint rather than a Regional Endpoint. If an error affects only the Namespace and the primary region's Workers stay healthy, the Namespace Endpoint follows the failover and those Workers reach the new active region cross-region. With private connectivity, the Workers need a network route to the cross-region VPC Endpoint. The alternative is to fail over the Workers and all of their dependencies whenever the Namespace fails over, so that no request crosses regions. -- **Route Workers to the active region's Codec Server and proxy.** There are two common approaches: - 1. Put DNS or a load balancer in front of the Codec Server and proxy address, and update it on failover to point at the new region's instances. - 2. Pass each Worker the Codec Server and proxy address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. -- **Single-region processing is the operator's responsibility.** To run Workers in only one region at a time, scale them down in the primary region before scaling them up in the secondary region. To enforce single-region processing within Temporal, use the [Active / Passive (Hot)](#active-hot) pattern instead. +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':30,'curve':'basis'}}}%% +flowchart LR + FS1["Detect
outage"] + FS3["Failover
DBs / queues"] + FS4["Scale down
Primary region
Workers"] + FS5["Scale up
Secondary region
Workers"] + FS6["Confirm
Workflows run
normally"] + FS1 --> FS3 --> FS4 --> FS5 --> FS6 +``` -**Component behavior** +- **Use the Namespace Endpoint.** + - Connect Workers through the Namespace Endpoint rather than a Regional Endpoint. The Namespace Endpoint always connects to the Namespace in its active region, and automatically follows the Namespace after a failover. + - **Rationale:** If an incident requires the Namespace to fail over while the rest of the primary region is healthy, the the Workers in the primary region will still connect to the Namespace and process Workflows. (During a Namespace incident, the Regional Endpoint for the primary region may not connect to the Namespace.) + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + NEW["Worker"]:::worker + NEEP["Namespace Endpoint"]:::endpoint + NEW --> NEEP + subgraph NEPRIM["Primary"] + NEPNS["Namespace"]:::ns + end + subgraph NESEC["Secondary"] + NESNS["Namespace"]:::ns + end + NEEP -->|"normal operation"| NEPNS + NEEP -.->|"after failover"| NESNS + class NEPRIM,NESEC region +``` + +- **Set up cross-region private connectivity.** + - If you use private connectivity, give the primary region's Workers a network route to the VPC Endpoint in the other region, so they can reach the active replica after a Namespace-only failover. If you can't provide that cross-region route, use the [Active / Passive (Hot)](#active-hot) pattern instead, where each region's Workers connect to their local replica. + - For the full setup of Regional Endpoints, VPC Endpoints, and cross-region routing, see [Connectivity for High Availability](/cloud/high-availability/ha-connectivity). + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph CRCPRIM["Primary Region"] + CRCPW["Worker"]:::worker + CRCPEP["VPC Endpoint"]:::endpoint + CRCPNS["Namespace"]:::ns + end + subgraph CRCSEC["Secondary Region"] + CRCSEP["VPC Endpoint"]:::endpoint + CRCSNS["Replica"]:::ns + end + CRCPW -->|"normal operation"| CRCPEP + CRCPEP --> CRCPNS + CRCSEP --> CRCSNS + CRCPW -.->|"after a Namespace failover"| CRCSEP + class CRCPRIM,CRCSEC region +``` + +- **Route Workers to the active region's Codec Server.** Two common approaches: + - Put DNS or a load balancer in front of the Codec Server address, and update it on failover to point at the new region's instance. + - Pass each Worker the Codec Server address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + CSW["Worker"]:::worker + subgraph CSPRIM["Primary"] + CSPC["Codec Server"]:::ext + end + subgraph CSSEC["Secondary"] + CSSC["Codec Server"]:::ext + end + CSW -->|"normal operation"| CSPC + CSW -.->|"after failover"| CSSC + class CSPRIM,CSSEC region +``` + +- **Route Workers to the active region's proxy.** Two common approaches: + - Put DNS or a load balancer in front of the proxy address, and update it on failover to point at the new region's instance. + - Pass each Worker the proxy address for its own region as configuration, so a Worker always uses the service local to it. This is common in Kubernetes or with service discovery. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + PXW["Worker"]:::worker + subgraph PXPRIM["Primary"] + PXPP["Proxy"]:::endpoint + end + subgraph PXSEC["Secondary"] + PXSP["Proxy"]:::endpoint + end + PXW -->|"normal operation"| PXPP + PXW -.->|"after failover"| PXSP + class PXPRIM,PXSEC region +``` + +Active / Cold Pattern: **Component behavior** - **Workers** — run only in the primary region; brought up in the secondary region during a failover. - **Workflow starters and Clients** — run with the Workers; brought up in the secondary region during a failover. - **Codec Servers and proxies** — run alongside the active Workers; scaled up in the secondary region as part of a failover. -- **External databases and queues** — single-region-active; fail over to the secondary region alongside the Workers. +- **Databases and queues** — single-region-active; fail over to the secondary region alongside the Workers. ### Active / Passive (Hot) {/* #active-hot */} -_Also known as Active / Hot Standby or Active / Hot._ +_Also known as "Active / Hot Standby" or "Active / Hot"._ + +Active / Hot Pattern: **Normal operation** Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet on standby. @@ -400,10 +509,7 @@ This is achieved by disabling forwarding for Worker polls and connecting each fl With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the standby fleet does no work and adds no cross-region overhead. ```mermaid ---- -title: Normal operation ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -411,37 +517,31 @@ flowchart LR classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DHNPRI["Primary"] - subgraph DHNWP["Worker Pool"] - DHNW1["Worker
Active"]:::worker - DHNW2["Worker
Active"]:::worker - DHNW3["Worker
Active"]:::worker - end + DHNWP["Workers
Active"]:::worker DHNNS["Namespace"]:::ns - DHNDB[("External DB / queue")]:::ext + DHNDB[("DB / queue")]:::ext + DHNCODEC["Codec Server"]:::ext DHNWP <-->|Workflows| DHNNS DHNWP <--> DHNDB + DHNWP <--> DHNCODEC end subgraph DHNSEC["Secondary"] DHNR["Replica"]:::ns - subgraph DHNWP2["Worker Pool"] - DHNSW1["Worker
Standby"]:::worker - DHNSW2["Worker
Standby"]:::worker - DHNSW3["Worker
Standby"]:::worker - end - DHNDB2[("External DB / queue
Standby")]:::ext + DHNWP2["Workers
Standby"]:::worker + DHNDB2[("DB / queue
Standby")]:::ext + DHNCODEC2["Codec Server"]:::ext DHNR <-->|"Connected"| DHNWP2 DHNWP2 <--> DHNDB2 + DHNWP2 <--> DHNCODEC2 end DHNNS -. replicates .-> DHNR class DHNPRI,DHNSEC region - class DHNWP,DHNWP2 pool ``` +Active / Hot Pattern: **On failover** + ```mermaid ---- -title: After failover ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -451,64 +551,62 @@ flowchart LR classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DHFPRI["Primary (outage)"] - subgraph DHFWP["Worker Pool"] - DHFW1["Unavailable"]:::down - end + DHFWP["Workers
Unavailable"]:::down DHFNS["Namespace"]:::down DHFWP ~~~ DHFNS end subgraph DHFSEC["Secondary"] DHFN["Namespace
(Active)"]:::ns - subgraph DHFSP["Worker Pool"] - DHFSW1["Worker
Active"]:::worker - DHFSW2["Worker
Active"]:::worker - DHFSW3["Worker
Active"]:::worker - end - DHFDB2[("External DB / queue
Promoted (if needed)")]:::ext + DHFSP["Workers
Active"]:::worker + DHFDB2[("DB / queue
Promoted")]:::ext + DHFCODEC["Codec Server"]:::ext DHFN <-->|Workflows| DHFSP DHFSP <--> DHFDB2 + DHFSP <--> DHFCODEC end DHFNS -->|"Failover"| DHFN class DHFPRI regiondown class DHFSEC region - class DHFWP,DHFSP pool ``` Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously standby fleet begins processing the moment the secondary region becomes active, so this pattern achieves the lowest recovery time. -**Benefits** +Active / Hot Pattern: **Benefits** -- **Easy to reason about.** Only one region is active at a time, so traffic routing and interactions with external systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. -- Lowest recovery time: the secondary-region Workers are already connected and warm. -- Low latency during normal operation: Tasks are processed only in the active region, with no cross-region forwarding. +- **Easy to reason about.** + - Only one region is active at a time, so traffic routing and interactions with systems (such as databases and queues) are simpler to understand, and the pattern pairs naturally with other active / passive systems. Active / Active, by contrast, requires deciding how Workers reach an active database: either a local active database in each region, or a single active / passive database that some Workers must reach cross-region. +- **Lowest overall recovery time of the three patterns.** + - The secondary-region Workers are already connected and warm, so failover involves no cold start. +- **Low latency during normal operation.** + - Tasks are processed only in the active region, with no cross-region forwarding. -**Tradeoffs** +Active / Hot Pattern: **Tradeoffs** -- Highest cost during normal operation: standby Worker capacity runs in the secondary region at all times. +- Highest overall architecture cost: a full standby Worker fleet runs in the secondary region at all times, even during steady state. -**Recommendations and important constraints** +Active / Hot Pattern: **Recommendations and important constraints** -- Connect each Worker fleet through its region's [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) (or VPC Endpoint) and [disable forwarding](/cloud/high-availability/enable#change-forwarding-behavior) for Worker polls. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. +- **Use Regional or VPC Endpoints and disable forwarding.** + - Connect each Worker fleet through its region's [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) (or VPC Endpoint) and [disable forwarding](/cloud/high-availability/enable#change-forwarding-behavior) for Worker polls. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. -**Component behavior** +Active / Hot Pattern: **Component behavior** - **Workers** — run in both regions; only the active region processes Workflows. - **Workflow starters and Clients** — run in both regions alongside the Workers. - **Codec Servers and proxies** — run in both regions continuously, not just after a failover. -- **External databases and queues** — typically single-region-active; fail over alongside the active Workers. +- **Databases and queues** — typically single-region-active; fail over alongside the active Workers. ### Active / Active {/* #active-active */} +Active / Active Pattern: **Normal operation** + Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). A Temporal Cloud Namespace is not "active/active" in the database sense; it still has a single active replica in one region. Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls are forwarded across regions to the active replica. ```mermaid ---- -title: Normal operation ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; @@ -516,36 +614,32 @@ flowchart LR classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DANPRI["Primary"] - subgraph DANWP["Worker Pool"] - DANW1["Worker
Active"]:::worker - DANW2["Worker
Active"]:::worker - end + DANWP["Workers
Active"]:::worker DANNS["Namespace"]:::ns - DANDB[("External DB / queue")]:::ext + DANDB[("DB / queue")]:::ext + DANCODEC["Codec Server"]:::ext DANWP <-->|Workflows| DANNS DANWP <--> DANDB + DANWP <--> DANCODEC end subgraph DANSEC["Secondary"] DANR["Replica"]:::ns - subgraph DANWP2["Worker Pool"] - DANS1["Worker
Active"]:::worker - DANS2["Worker
Active"]:::worker - end - DANDB2[("External DB / queue")]:::ext + DANWP2["Workers
Active"]:::worker + DANDB2[("DB / queue")]:::ext + DANCODEC2["Codec Server"]:::ext DANWP2 <-->|Workflows| DANR DANWP2 <--> DANDB2 + DANWP2 <--> DANCODEC2 end DANNS -. replicates .-> DANR DANR ==>|"forwards polls"| DANNS class DANPRI,DANSEC region - class DANWP,DANWP2 pool ``` +Active / Active Pattern: **On failover** + ```mermaid ---- -title: After failover ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; @@ -556,65 +650,63 @@ flowchart LR classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DAFPRI["Primary (outage)"] - subgraph DAFWP["Worker Pool"] - DAFW1["Unavailable"]:::down - end + DAFWP["Workers
Unavailable"]:::down DAFNS["Namespace"]:::down DAFWP ~~~ DAFNS end subgraph DAFSEC["Secondary"] DAFN["Namespace
(Active)"]:::ns - subgraph DAFSP["Worker Pool"] - DAFSW1["Worker
Active"]:::worker - DAFSW2["Worker
Active"]:::worker - DAFSW3["Worker
Scaled up
(as needed)"]:::workerhollow - end - DAFDB2[("External DB / queue")]:::ext + DAFSP["Workers
Active,
scaled up as needed"]:::worker + DAFDB2[("DB / queue")]:::ext + DAFCODEC["Codec Server"]:::ext DAFN <-->|Workflows| DAFSP DAFSP <--> DAFDB2 + DAFSP <--> DAFCODEC end DAFNS -->|"Failover"| DAFN class DAFPRI regiondown class DAFSEC region - class DAFWP,DAFSP pool ``` This is a practical way to reach a low recovery time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. Unlike Active / Passive (Cold), Workflows keep processing in the surviving region while capacity scales up, so there is no cold-start gap. -**Benefits** +Active / Active Pattern: **Benefits** -- Low recovery time: the surviving region keeps processing while capacity scales up. -- Balanced cost: roughly half the fleet runs in each region during normal operation. +- **Low overall recovery time.** + - The surviving region keeps processing while capacity scales up. +- **Moderate overall architecture cost.** + - Roughly half the fleet runs in each region during steady state, with no dedicated standby fleet. -**Tradeoffs** +Active / Active Pattern: **Tradeoffs** - The secondary region pays cross-region latency, because its polls are forwarded to the active replica. This can be a problem for latency-sensitive Workflows. - Synchronizing external systems is harder, because Workers are active in both regions at once. -**Recommendations and important constraints** +Active / Active Pattern: **Recommendations and important constraints** -- Keep forwarding enabled (the default) so the secondary-region Workers' polls reach the active replica. Do not set `disablePassivePollerForwarding`. +- **Keep forwarding enabled.** + - Leave forwarding on (the default) so the secondary-region Workers' polls reach the active replica. Do not set `disablePassivePollerForwarding`. -**Component behavior** +Active / Active Pattern: **Component behavior** - **Workers** — run and process in both regions; the secondary region's polls are forwarded to the active replica. - **Workflow starters and Clients** — run in both regions. - **Codec Servers and proxies** — run in both regions continuously. -- **External databases and queues** — accessed from both regions; cross-region consistency must be designed for. +- **Databases and queues** — accessed from both regions; cross-region consistency must be designed for. ### Dual Active (Multi-Active) {/* #dual-active */} +Dual Active Pattern: **Normal operation** + Beyond the three main patterns, some architectures need low-latency or region-bound data in *each* region at once. This can be achieved with **two Namespaces whose active and passive regions overlap**: each region holds one Namespace's active replica and the other Namespace's passive replica. ```mermaid ---- -title: Normal operation ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; subgraph DUNR1["Region 1"] @@ -624,7 +716,9 @@ flowchart LR end DUNNAa["Namespace A"]:::ns DUNNBp["Namespace B
Replica"]:::ns + DUNCODEC1["Codec Server"]:::ext DUNWPA <-->|Workflows| DUNNAa + DUNWPA <--> DUNCODEC1 end subgraph DUNR2["Region 2"] subgraph DUNWPB["App B Worker Pool"] @@ -633,7 +727,9 @@ flowchart LR end DUNNBa["Namespace B"]:::ns DUNNAp["Namespace A
Replica"]:::ns + DUNCODEC2["Codec Server"]:::ext DUNWPB <-->|Workflows| DUNNBa + DUNWPB <--> DUNCODEC2 end DUNNAa -. replicates .-> DUNNAp DUNNBa -. replicates .-> DUNNBp @@ -641,15 +737,15 @@ flowchart LR class DUNWPA,DUNWPB pool ``` +Dual Active Pattern: **On failover (Region 1 outage)** + ```mermaid ---- -title: After failover (Region 1 outage) ---- -%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; @@ -672,8 +768,10 @@ flowchart LR end DUFNBa["Namespace B
(Active)"]:::ns DUFNAa2["Namespace A
(Active)"]:::ns + DUFCODEC["Codec Server"]:::ext DUFWPB <-->|Workflows| DUFNBa DUFWPA2 <-->|Workflows| DUFNAa2 + DUFWPB <--> DUFCODEC end DUFNAa -->|"Failover"| DUFNAa2 class DUFR1 regiondown @@ -685,22 +783,24 @@ Each Namespace serves low-latency requests or a regionally-bound database in its Workloads on Temporal rarely need this. It pays off only when a workload is *both* extremely latency-sensitive across several same-continent regions *and* needs multi-region disaster recovery, an uncommon combination. -**Benefits** +Dual Active Pattern: **Benefits** -- Low-latency, region-bound data in each region during normal operation. -- Each Namespace fails over independently, like Active / Passive. +- **Low-latency, region-bound data in each region.** + - Served from each region's active Namespace during normal operation. +- **Independent failover.** + - Each Namespace fails over independently, like Active / Passive. -**Tradeoffs** +Dual Active Pattern: **Tradeoffs** -- Highest cost and operational complexity: two Worker fleets and two Namespaces. +- Highest overall architecture cost and operational complexity: two Worker fleets and two Namespaces. - Rarely justified. Temporal recommends treating each Namespace as an **independent Active / Passive deployment**, with its own Worker pools and failover procedures, rather than coupling them. -**Component behavior** +Dual Active Pattern: **Component behavior** - **Workers** — one fleet per application, each active in its Namespace's region. - **Workflow starters and Clients** — run with each application's Workers. - **Codec Servers and proxies** — run in both regions, for both Namespaces. -- **External databases and queues** — region-bound per application; each fails over with its Namespace. +- **Databases and queues** — region-bound per application; each fails over with its Namespace. ## Choose a deployment pattern {/* #choose */} @@ -717,7 +817,7 @@ The Worker deployment pattern sets the approach; the supporting pieces follow it - **Workflow starters and Clients.** Deploy these with the same regional pattern as the Workers, since a starter or Client often shares the same in-region dependencies (databases, queues, upstream services) and should fail over alongside them. Point Clients at the Namespace Endpoint so they follow the active region automatically with no configuration change on failover, and use a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) only when a Client must be pinned to a region. - **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active patterns, run them in both regions at all times. -- **External databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker deployment pattern: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. +- **Databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker deployment pattern: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. ## Related {/* #related */} From d6b5d36a5bd22e23bf165a27f593cc67cc839000 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Tue, 16 Jun 2026 12:30:40 -0700 Subject: [PATCH 9/9] updates --- .../high-availability/deployment-patterns.mdx | 228 ++++++++++++------ 1 file changed, 153 insertions(+), 75 deletions(-) diff --git a/docs/cloud/high-availability/deployment-patterns.mdx b/docs/cloud/high-availability/deployment-patterns.mdx index cba677ca3c..5c3af8e5d1 100644 --- a/docs/cloud/high-availability/deployment-patterns.mdx +++ b/docs/cloud/high-availability/deployment-patterns.mdx @@ -33,6 +33,29 @@ Beyond the Namespace itself, these components live in the application environmen - **Proxies between Workers and Temporal Cloud** — any forward proxy or mTLS terminator in the connection path between Workers / Starters / Clients → Namespace. - **Databases and queues** — the systems that Activities read and write. +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef app fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef env fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph WFENV["Application environment"] + WFW["Workers"]:::app + WFCL["Workflow starters / Clients"]:::app + WFCODEC["Codec Server"]:::app + WFDB[("DB / queue")]:::app + WFPROXY["Proxy"]:::endpoint + end + WFNS["Namespace
(Temporal Cloud)"]:::ns + WFW <--> WFCODEC + WFW <--> WFDB + WFW --> WFPROXY + WFCL --> WFPROXY + WFPROXY --> WFNS + class WFENV env +``` + Some systems must be active wherever Workers are running (for example, Codec Servers), while others might follow a different failover sequence (for example, databases). Because the right choice for each of these usually depends on where Workers run, **this page focuses on Worker deployment patterns**. @@ -395,8 +418,8 @@ flowchart LR ``` - **Use the Namespace Endpoint.** - - Connect Workers through the Namespace Endpoint rather than a Regional Endpoint. The Namespace Endpoint always connects to the Namespace in its active region, and automatically follows the Namespace after a failover. - - **Rationale:** If an incident requires the Namespace to fail over while the rest of the primary region is healthy, the the Workers in the primary region will still connect to the Namespace and process Workflows. (During a Namespace incident, the Regional Endpoint for the primary region may not connect to the Namespace.) + - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region. + - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region. ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% @@ -503,10 +526,10 @@ _Also known as "Active / Hot Standby" or "Active / Hot"._ Active / Hot Pattern: **Normal operation** -Workers are deployed in **both regions**, but only the active region processes Workflows. The secondary-region Workers stay connected and warm, yet on standby. - -This is achieved by disabling forwarding for Worker polls and connecting each fleet to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity). -With forwarding disabled, polls that reach the passive replica are not sent to the active region, so the standby fleet does no work and adds no cross-region overhead. +- **Workers run in both regions.** A full Worker fleet runs in each region. The primary region's Workers are active and process all Workflows; the secondary region's Workers stay connected and warm, but on standby, doing no work. +- **Workflows process in only one region at a time.** The Namespace has a single active replica, so even though Workers run in both regions, Workflows execute only in the active (primary) region. +- **Forwarding is disabled for Worker polls.** Each fleet connects to its local replica through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) or [VPC Endpoint](/cloud/high-availability/ha-connectivity) with forwarding off, so polls that reach the passive replica are not sent to the active region. The standby fleet does no work and adds no cross-region overhead. +- **The Namespace replicates to the secondary region.** A Namespace with High Availability keeps an active replica in the primary region and a passive replica in the secondary region, continuously replicating Workflow state so the standby is ready to take over. ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% @@ -540,6 +563,10 @@ flowchart LR Active / Hot Pattern: **On failover** +- **The Namespace and Workers fail over together, automatically.** When the primary region fails, Temporal Cloud promotes the secondary replica to active, and the secondary region's standby Workers — already connected and warm — begin processing immediately. +- **No cold start and no DNS wait.** Because a full Worker fleet was already running in the secondary region, there's nothing to start or scale up before processing resumes. This pattern achieves the lowest recovery time of the three. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the secondary region's copy active so the now-active Workers can read and write it. + ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR @@ -569,8 +596,6 @@ flowchart LR class DHFSEC region ``` -Failover is near-instant: the Namespace failover and the Worker "failover" happen together and automatically, with no DNS wait and no cold start. The previously standby fleet begins processing the moment the secondary region becomes active, so this pattern achieves the lowest recovery time. - Active / Hot Pattern: **Benefits** - **Easy to reason about.** @@ -589,6 +614,31 @@ Active / Hot Pattern: **Recommendations and important constraints** - **Use Regional or VPC Endpoints and disable forwarding.** - Connect each Worker fleet through its region's [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) (or VPC Endpoint) and [disable forwarding](/cloud/high-availability/enable#change-forwarding-behavior) for Worker polls. Using the Namespace Endpoint by mistake routes the standby Workers to the active region and defeats the pattern. +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef endpoint fill:transparent,stroke:#c2c8d2,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph HEPRIM["Primary"] + HEPW["Workers
Active"]:::worker + HEPEP["Regional / VPC Endpoint"]:::endpoint + HEPNS["Namespace"]:::ns + HEPW --> HEPEP + HEPEP --> HEPNS + end + subgraph HESEC["Secondary"] + HESW["Workers
Standby"]:::worker + HESEP["Regional / VPC Endpoint"]:::endpoint + HESNS["Replica"]:::ns + HESW --> HESEP + HESEP --> HESNS + end + HEPNS -. replicates .-> HESNS + class HEPRIM,HESEC region +``` + Active / Hot Pattern: **Component behavior** - **Workers** — run in both regions; only the active region processes Workflows. @@ -600,10 +650,10 @@ Active / Hot Pattern: **Component behavior** Active / Active Pattern: **Normal operation** -Workers run in **both regions and process Workflows at the same time**, with forwarding left enabled (the default). - -A Temporal Cloud Namespace is not "active/active" in the database sense; it still has a single active replica in one region. -Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls are forwarded across regions to the active replica. +- **Workers run and process in both regions at once.** A full Worker fleet runs in each region, and both fleets process Workflows concurrently. +- **The Namespace still has a single active replica.** A Temporal Cloud Namespace is not "active/active" in the database sense — one region holds the active replica and the other holds a passive replica. Forwarding is left enabled (the default). +- **The passive region forwards polls to the active replica.** Because the passive replica transparently forwards requests to and from the active region, a Worker fleet in either region can process Workflows. The secondary fleet's polls cross regions to reach the active replica, which adds some latency. +- **Roughly half the fleet runs in each region.** Total capacity is split across the two regions during steady state, with no dedicated standby fleet. ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% @@ -638,6 +688,11 @@ flowchart LR Active / Active Pattern: **On failover** +- **The surviving region keeps processing.** When one region fails, the other region's Workers are already active and processing, so Workflows continue running with no cold-start gap. +- **The Namespace fails over to the surviving region.** Temporal Cloud promotes the surviving region's replica to active; its local Workers then process against it without forwarding polls across regions. +- **Scale up capacity in the surviving region.** Each region normally runs only about half the fleet, so add capacity in the surviving region to handle the full workload at full throughput. +- **Promote your databases and queues, if needed.** If your Workflows depend on external data, make the surviving region's copy active so the Workers there can read and write it. + ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% flowchart LR @@ -668,9 +723,6 @@ flowchart LR class DAFSEC region ``` -This is a practical way to reach a low recovery time at balanced cost. Roughly half the fleet runs in each region, and capacity is added to the surviving region during an outage to reach full throughput. -Unlike Active / Passive (Cold), Workflows keep processing in the surviving region while capacity scales up, so there is no cold-start gap. - Active / Active Pattern: **Benefits** - **Low overall recovery time.** @@ -688,6 +740,26 @@ Active / Active Pattern: **Recommendations and important constraints** - **Keep forwarding enabled.** - Leave forwarding on (the default) so the secondary-region Workers' polls reach the active replica. Do not set `disablePassivePollerForwarding`. +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + subgraph AAPRIM["Primary"] + AAPW["Workers"]:::worker + AAPNS["Namespace
(Active)"]:::ns + AAPW <-->|Workflows| AAPNS + end + subgraph AASEC["Secondary"] + AASW["Workers"]:::worker + AASR["Replica"]:::ns + AASW <-->|Workflows| AASR + end + AASR -.->|"forwards polls"| AAPNS + class AAPRIM,AASEC region +``` + Active / Active Pattern: **Component behavior** - **Workers** — run and process in both regions; the secondary region's polls are forwarded to the active replica. @@ -703,80 +775,62 @@ Beyond the three main patterns, some architectures need low-latency or region-bo ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% -flowchart LR - classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; +flowchart TD + classDef appA fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef appB fill:#0EA5E922,stroke:#0EA5E9,stroke-width:1px; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; - classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; - subgraph DUNR1["Region 1"] - subgraph DUNWPA["App A Worker Pool"] - DUNA1["Worker
Active"]:::worker - DUNA2["Worker
Active"]:::worker - end - DUNNAa["Namespace A"]:::ns - DUNNBp["Namespace B
Replica"]:::ns - DUNCODEC1["Codec Server"]:::ext - DUNWPA <-->|Workflows| DUNNAa - DUNWPA <--> DUNCODEC1 - end subgraph DUNR2["Region 2"] - subgraph DUNWPB["App B Worker Pool"] - DUNB1["Worker
Active"]:::worker - DUNB2["Worker
Active"]:::worker - end - DUNNBa["Namespace B"]:::ns - DUNNAp["Namespace A
Replica"]:::ns - DUNCODEC2["Codec Server"]:::ext + DUNWPB["App B Workers"]:::appB + DUNNBa["Namespace B"]:::appB + DUNNAp["Namespace A
Replica"]:::appA DUNWPB <-->|Workflows| DUNNBa - DUNWPB <--> DUNCODEC2 + end + subgraph DUNR1["Region 1"] + DUNWPA["App A Workers"]:::appA + DUNNAa["Namespace A"]:::appA + DUNNBp["Namespace B
Replica"]:::appB + DUNWPA <-->|Workflows| DUNNAa end DUNNAa -. replicates .-> DUNNAp DUNNBa -. replicates .-> DUNNBp + linkStyle 0 stroke:#0EA5E9,stroke-width:1.5px + linkStyle 1 stroke:#7C3AED,stroke-width:1.5px + linkStyle 2 stroke:#7C3AED,stroke-width:1.5px + linkStyle 3 stroke:#0EA5E9,stroke-width:1.5px class DUNR1,DUNR2 region - class DUNWPA,DUNWPB pool ``` Dual Active Pattern: **On failover (Region 1 outage)** ```mermaid %%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% -flowchart LR - classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; - classDef workerhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; - classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; - classDef ext fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; +flowchart TD + classDef appA fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef appAhollow fill:transparent,stroke:#7C3AED,stroke-width:1px,stroke-dasharray:4 3; + classDef appB fill:#0EA5E922,stroke:#0EA5E9,stroke-width:1px; classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; - classDef pool fill:transparent,stroke:#c2c8d2,stroke-width:1px; - subgraph DUFR1["Region 1 (outage)"] - subgraph DUFWPA["App A Worker Pool"] - DUFA1["Unavailable"]:::down - end - DUFNAa["Namespace A"]:::down - DUFWPA ~~~ DUFNAa - end subgraph DUFR2["Region 2"] - subgraph DUFWPB["App B Worker Pool"] - DUFB1["Worker
Active"]:::worker - DUFB2["Worker
Active"]:::worker - end - subgraph DUFWPA2["App A Worker Pool"] - DUFA2a["Worker
Brought up"]:::workerhollow - DUFA2b["Worker
Brought up"]:::workerhollow - end - DUFNBa["Namespace B
(Active)"]:::ns - DUFNAa2["Namespace A
(Active)"]:::ns - DUFCODEC["Codec Server"]:::ext + DUFWPB["App B Workers"]:::appB + DUFNBa["Namespace B"]:::appB + DUFNAa2["Namespace A
(Active)"]:::appA + DUFWPA2["App A Workers
Hot or Cold Start"]:::appAhollow DUFWPB <-->|Workflows| DUFNBa DUFWPA2 <-->|Workflows| DUFNAa2 - DUFWPB <--> DUFCODEC + end + subgraph DUFR1["Region 1 (outage)"] + DUFWPA["App A Workers
Unavailable"]:::down + DUFNAa["Namespace A"]:::down + DUFNBp["Namespace B
Replica"]:::down + DUFWPA ~~~ DUFNAa end DUFNAa -->|"Failover"| DUFNAa2 + linkStyle 0 stroke:#0EA5E9,stroke-width:1.5px + linkStyle 1 stroke:#7C3AED,stroke-width:1.5px + linkStyle 3 stroke:#7C3AED,stroke-width:1.5px class DUFR1 regiondown class DUFR2 region - class DUFWPA,DUFWPB,DUFWPA2 pool ``` Each Namespace serves low-latency requests or a regionally-bound database in its own active region, and fails over to the other region during an outage. The same idea extends across more than two regions. Each Namespace fails over independently, following the Active / Passive sequence. @@ -802,15 +856,6 @@ Dual Active Pattern: **Component behavior** - **Codec Servers and proxies** — run in both regions, for both Namespaces. - **Databases and queues** — region-bound per application; each fails over with its Namespace. -## Choose a deployment pattern {/* #choose */} - -| Pattern | Recovery time | Normal-operation cost | Best when | -| --- | --- | --- | --- | -| **Active / Passive (Cold)** | Highest (cold start in secondary) | Lowest (one fleet) | Adopting High Availability with the simplest operations. | -| **Active / Passive (Hot)** | Lowest (warm, no DNS wait) | Higher (standby fleet) | The lowest recovery time is required and the data plane is pinned to one region at a time. | -| **Active / Active** | Low (surviving region keeps processing) | Higher (two live fleets) | Low recovery time at balanced cost, where the secondary region can tolerate cross-region latency. | -| **Dual Active** | Low (per Namespace) | Highest (two fleets, two Namespaces) | Low-latency, region-bound data is genuinely required in each region. Rare. | - ## The rest of the architecture {/* #rest-of-architecture */} The Worker deployment pattern sets the approach; the supporting pieces follow it. @@ -819,6 +864,39 @@ The Worker deployment pattern sets the approach; the supporting pieces follow it - **Codec Servers and proxies.** Anything in the connection path between Workers and Temporal Cloud must be reachable from every region where Workers connect. In Active / Passive (Cold), scale them up in the secondary region as part of a failover; in the Active / Passive (Hot) and Active / Active patterns, run them in both regions at all times. - **Databases and queues.** These remain the application's responsibility, and the right approach depends on the Worker deployment pattern: a single-region-active datastore pairs naturally with Active / Passive, while running Workers active in both regions raises consistency questions that must be designed for. Detailed guidance is out of scope for this page. +## Serverless Workers failover {/* #serverless-workers-failover */} + +In every pattern above, the Worker fleet is something you run, so failing it over — a cold start, a standby fleet, or a second active region — is the application's responsibility. [Serverless Workers](/develop/typescript/workers/serverless-workers) move that responsibility to Temporal Cloud. + +Instead of long-lived Workers that poll a Task Queue, Serverless Workers invert the model: Temporal Cloud pushes Task invocations to a customer-owned compute function (AWS Lambda today). Because Temporal Cloud is the component that starts the Workers, it can also start them in the secondary region after a failover, with no action from you. + +- **One Worker Deployment spans both regions.** You register a compute function per region under a single Build ID, so the deployment is ready to run in either region. +- **Failover is automatic.** When the Namespace fails over, Temporal Cloud invokes the function in the new active region — there's no fleet to detect the outage and bring up. +- **The whole system fails over hands-off.** Both the Namespace and the Workers move automatically, lowering overall recovery time by removing the manual Worker-failover step that the patterns above require. + +```mermaid +%%{init: {'themeVariables':{'fontFamily':'Inter, ui-sans-serif, system-ui, sans-serif','edgeLabelBackground':'transparent'},'flowchart':{'nodeSpacing':18,'rankSpacing':45,'curve':'basis','subGraphTitleMargin':{'top':6,'bottom':12}}}}%% +flowchart LR + classDef ns fill:#59FDA024,stroke:#59FDA0,stroke-width:1px; + classDef worker fill:#7C3AED22,stroke:#7C3AED,stroke-width:1px; + classDef down fill:#ED360E14,stroke:#ED360E,stroke-width:1px,stroke-dasharray:3 3,color:#ED360E; + classDef region fill:transparent,stroke:#9aa4b2,stroke-width:1.5px; + classDef regiondown fill:#ED360E0D,stroke:#ED360E,stroke-width:1.5px; + subgraph SWPRI["Primary (outage)"] + SWPNS["Namespace"]:::down + end + subgraph SWSEC["Secondary"] + SWSNS["Namespace
(Active)"]:::ns + SWSL["Serverless Workers
(Lambda)"]:::worker + SWSNS -->|"Temporal Cloud
starts Workers"| SWSL + end + SWPNS -->|"Failover"| SWSNS + class SWPRI regiondown + class SWSEC region +``` + +On failover, Temporal Cloud promotes the secondary replica to active and invokes the Worker function there — no fleet to bring up and nothing for you to do. The Worker failover is hands-off. + ## Related {/* #related */} To add a replica and turn on High Availability features, see [Enable and manage High Availability](/cloud/high-availability/enable).