Skip to content

Conversation

@jaypipes
Copy link

This commit lays the API groundwork for the serverless workers feature. In order for something other than the client/SDK to "manage" a Worker using a serverless compute platform like AWS Lambda, we need some place to store configuration settings like the Lambda Function ARN and IAM Role ARN that needs to be assumed when invoking the Lambda Function.

The PR adds a new field compute_provider to WorkerDeploymentInfo that contains a temporal.api.deployment.v1.ComputeProvider message. This message is a simple wrapper around a blob of bytes that contains configuration settings for a worker lifecycle controller.

The configuration settings for the worker lifecycle controller are specific to different compute providers. The first compute provider we are modeling is AWS Lambda, and the ComputeProviderAWSLambda message houses the ARN of the AWS Lambda Function to be invoked and an optional ARN of the IAM Role that should be assumed by a worker control plane controller when invoking the function.

@jaypipes jaypipes requested review from a team as code owners January 28, 2026 15:30
@CLAassistant
Copy link

CLAassistant commented Jan 28, 2026

CLA assistant check
All committers have signed the CLA.

@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch 2 times, most recently from 988feaa to 2efdaa8 Compare January 28, 2026 15:49
Adds a new field `compute_provider` to `WorkerDeploymentInfo` that
contains a `temporal.api.deployment.v1.ComputeProvider` message. This
message is a simple wrapper around a blob of bytes that contains
configuration settings for a worker lifecycle controller.

The configuration settings for the worker lifecycle controller are
specific to different compute providers. The first compute provider we
are modeling is AWS Lambda, and the `ComputeProviderAWSLambda` message
houses the ARN of the AWS Lambda Function to be invoked and an optional
ARN of the IAM Role that should be assumed by a worker control plane
controller when invoking the function.

Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
@jaypipes jaypipes force-pushed the jaypipes/compute-provider branch from 2efdaa8 to 5a5db55 Compare January 28, 2026 17:10
// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this have a field like provider_type or similar to identify who should interpret the details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?

Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payloads for proto messages actually do put their type in the metadata just like Any, so they work for this use case, but I am going to be making a comment down there right now concerning issues with using proto here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... it happened when I ran make buf-lint locally :(

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good (and simple which is awesome). This cannot really be merged to main anytime soon until the implementation using it has basically been completely built due to concerns of this being mutated as we work on the impl.

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderAWSLambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message ComputeProviderAWSLambda {
message ComputeProviderDetailsAWSLambda {

Because this is not a "compute provider" as we've now defined it since "compute provider" as we've now defined it is "compute provider details" + "scaling config" + "endpoint config".

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even as placeholders, we should go ahead and add fields that represent scaling algo config and control plane Nexus endpoint config (even if for many they will use a "system" one that we run)

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm


// Contains the new worker compute provider configuration for the
// WorkerDeployment.
temporal.api.deployment.v1.ComputeProvider compute_provider = 20;
Copy link
Member

@cretz cretz Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the absence/null of this represents removing compute provider?

Comment on lines +922 to +923
// Creates or replaces the compute provider configuration for a Worker
// Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be pretty clear/obvious here or somewhere that this completely changes how task queues operate and completely changes the worker paradigm.

I fear the triviality of the API/CLI names and such don't convey the significance of this "switch flip" that completely changes how these task queues will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?

Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the proper server-side worker lifecycle management and clearer understanding by users, a task queue needs to use server-owned workers or user-owned workers. Having a hybrid can cause unpredictability and shift expected numbers, known workers, etc. If a base load is needed, why can't the server control the lifetimes of those workers too (or at least know about them)? If that means that you just tell the server that these may exist, that can make sense maybe. But to have the server not even know about workers on a task queue where it expects to control workers is going to cause confusion and unpredictability IMO.

I'd go as far as saying from a server POV a worker should not be allowed to poll on a task queue unless it was a worker the server configured. This way you don't have silent processing workers out there the server can't account for in its scaling/versioning logic.

// on a TaskQueue that has no active pollers in the last five minutes, a
// serverless worker lifecycle controller might need to invoke an AWS Lambda
// Function that itself ends up calling the SDK's worker.New() function.
message ComputeProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?

message ComputeProviderAWSLambda {
// The Qualified or Unqualified ARN of the AWS Lambda Function to be
// invoked.
string invoke = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:

Suggested change
string invoke = 1;
string invoke_arn = 1;

without arn in the name I would have assumed this is an http url

Comment on lines +922 to +923
// Creates or replaces the compute provider configuration for a Worker
// Deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?

// An individual ComputeProviderAWSLambda,
// ComputeProviderGoogleCloudRun, etc will be encrypted and
// serialized into compute_provider_config.
temporal.api.common.v1.Payload provider_details = 1;
Copy link
Member

@cretz cretz Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add a note here based on internal discussions. This can only really be a payload for custom providers, because people have "encrypt every payload" proxies and cloud side needing to read these prevents people from encrypting this.

But there is a problem using proto at all here IMO, see comment below at https://github.com/temporalio/api/pull/704/changes#r2741776135.

So we should have provider type w/ a oneof I think:

Suggested change
temporal.api.common.v1.Payload provider_details = 1;
// Passed to provider Nexus endpoint to identify the contents of the provider details
string provider_type = 1;
oneof provider_details {
// Non-encoded provider details, used in situations where decoding cannot occur
bytes provider_details_open = 2;
// Encoded provider details
temporal.api.common.v1.Payload provider_details_payload = 3;
}

Note how provider_details_open is bytes and not google.protobuf.Any here

// Function via its ARN. This will start the Worker that is housed within that
// AWS Lambda Function and the Worker will begin listening to Tasks on the
// WorkerDeployment's configured TaskQueue.
message ComputeProviderAWSLambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these should not be protos, these should be JSON schema defined as part of a Nexus IDL (https://github.com/nexus-rpc/nexus-rpc-gen). People should have to use proto to implement a compute provider accepting this (including us).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants