-
Notifications
You must be signed in to change notification settings - Fork 85
feature: add compute provider to worker deployment #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
988feaa to
2efdaa8
Compare
Adds a new field `compute_provider` to `WorkerDeploymentInfo` that contains a `temporal.api.deployment.v1.ComputeProvider` message. This message is a simple wrapper around a blob of bytes that contains configuration settings for a worker lifecycle controller. The configuration settings for the worker lifecycle controller are specific to different compute providers. The first compute provider we are modeling is AWS Lambda, and the `ComputeProviderAWSLambda` message houses the ARN of the AWS Lambda Function to be invoked and an optional ARN of the IAM Role that should be assumed by a worker control plane controller when invoking the function. Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
2efdaa8 to
5a5db55
Compare
| // on a TaskQueue that has no active pollers in the last five minutes, a | ||
| // serverless worker lifecycle controller might need to invoke an AWS Lambda | ||
| // Function that itself ends up calling the SDK's worker.New() function. | ||
| message ComputeProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this have a field like provider_type or similar to identify who should interpret the details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Payloads for proto messages actually do put their type in the metadata just like Any, so they work for this use case, but I am going to be making a comment down there right now concerning issues with using proto here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no... it happened when I ran make buf-lint locally :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good (and simple which is awesome). This cannot really be merged to main anytime soon until the implementation using it has basically been completely built due to concerns of this being mutated as we work on the impl.
| // Function via its ARN. This will start the Worker that is housed within that | ||
| // AWS Lambda Function and the Worker will begin listening to Tasks on the | ||
| // WorkerDeployment's configured TaskQueue. | ||
| message ComputeProviderAWSLambda { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| message ComputeProviderAWSLambda { | |
| message ComputeProviderDetailsAWSLambda { |
Because this is not a "compute provider" as we've now defined it since "compute provider" as we've now defined it is "compute provider details" + "scaling config" + "endpoint config".
| // on a TaskQueue that has no active pollers in the last five minutes, a | ||
| // serverless worker lifecycle controller might need to invoke an AWS Lambda | ||
| // Function that itself ends up calling the SDK's worker.New() function. | ||
| message ComputeProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even as placeholders, we should go ahead and add fields that represent scaling algo config and control plane Nexus endpoint config (even if for many they will use a "system" one that we run)
| // on a TaskQueue that has no active pollers in the last five minutes, a | ||
| // serverless worker lifecycle controller might need to invoke an AWS Lambda | ||
| // Function that itself ends up calling the SDK's worker.New() function. | ||
| message ComputeProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned in adjacent comment at https://github.com/temporalio/api/pull/704/changes#r2737966940, but I think the config about which control plane this goes to is missing atm
|
|
||
| // Contains the new worker compute provider configuration for the | ||
| // WorkerDeployment. | ||
| temporal.api.deployment.v1.ComputeProvider compute_provider = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the absence/null of this represents removing compute provider?
| // Creates or replaces the compute provider configuration for a Worker | ||
| // Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be pretty clear/obvious here or somewhere that this completely changes how task queues operate and completely changes the worker paradigm.
I fear the triviality of the API/CLI names and such don't convey the significance of this "switch flip" that completely changes how these task queues will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the proper server-side worker lifecycle management and clearer understanding by users, a task queue needs to use server-owned workers or user-owned workers. Having a hybrid can cause unpredictability and shift expected numbers, known workers, etc. If a base load is needed, why can't the server control the lifetimes of those workers too (or at least know about them)? If that means that you just tell the server that these may exist, that can make sense maybe. But to have the server not even know about workers on a task queue where it expects to control workers is going to cause confusion and unpredictability IMO.
I'd go as far as saying from a server POV a worker should not be allowed to poll on a task queue unless it was a worker the server configured. This way you don't have silent processing workers out there the server can't account for in its scaling/versioning logic.
| // on a TaskQueue that has no active pollers in the last five minutes, a | ||
| // serverless worker lifecycle controller might need to invoke an AWS Lambda | ||
| // Function that itself ends up calling the SDK's worker.New() function. | ||
| message ComputeProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google.protobuf.Any is made for this sort of situation, but it doesn't handle the encryption and other payload metadata. Maybe we should say the Payload contains an Any, which contains the underlying config plus a type url?
| message ComputeProviderAWSLambda { | ||
| // The Qualified or Unqualified ARN of the AWS Lambda Function to be | ||
| // invoked. | ||
| string invoke = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe:
| string invoke = 1; | |
| string invoke_arn = 1; |
without arn in the name I would have assumed this is an http url
| // Creates or replaces the compute provider configuration for a Worker | ||
| // Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it? Can these task queues not be polled like normal? (as a "base load" below burstable compute) I know we're not going to focus on that case at first but will we disable it?
| // An individual ComputeProviderAWSLambda, | ||
| // ComputeProviderGoogleCloudRun, etc will be encrypted and | ||
| // serialized into compute_provider_config. | ||
| temporal.api.common.v1.Payload provider_details = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add a note here based on internal discussions. This can only really be a payload for custom providers, because people have "encrypt every payload" proxies and cloud side needing to read these prevents people from encrypting this.
But there is a problem using proto at all here IMO, see comment below at https://github.com/temporalio/api/pull/704/changes#r2741776135.
So we should have provider type w/ a oneof I think:
| temporal.api.common.v1.Payload provider_details = 1; | |
| // Passed to provider Nexus endpoint to identify the contents of the provider details | |
| string provider_type = 1; | |
| oneof provider_details { | |
| // Non-encoded provider details, used in situations where decoding cannot occur | |
| bytes provider_details_open = 2; | |
| // Encoded provider details | |
| temporal.api.common.v1.Payload provider_details_payload = 3; | |
| } |
Note how provider_details_open is bytes and not google.protobuf.Any here
| // Function via its ARN. This will start the Worker that is housed within that | ||
| // AWS Lambda Function and the Worker will begin listening to Tasks on the | ||
| // WorkerDeployment's configured TaskQueue. | ||
| message ComputeProviderAWSLambda { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO these should not be protos, these should be JSON schema defined as part of a Nexus IDL (https://github.com/nexus-rpc/nexus-rpc-gen). People should have to use proto to implement a compute provider accepting this (including us).
This commit lays the API groundwork for the serverless workers feature. In order for something other than the client/SDK to "manage" a Worker using a serverless compute platform like AWS Lambda, we need some place to store configuration settings like the Lambda Function ARN and IAM Role ARN that needs to be assumed when invoking the Lambda Function.
The PR adds a new field
compute_providertoWorkerDeploymentInfothat contains atemporal.api.deployment.v1.ComputeProvidermessage. This message is a simple wrapper around a blob of bytes that contains configuration settings for a worker lifecycle controller.The configuration settings for the worker lifecycle controller are specific to different compute providers. The first compute provider we are modeling is AWS Lambda, and the
ComputeProviderAWSLambdamessage houses the ARN of the AWS Lambda Function to be invoked and an optional ARN of the IAM Role that should be assumed by a worker control plane controller when invoking the function.