add rfc for compute driver lifecycle extensions#1346
Conversation
Signed-off-by: Patrick Riel <priel@nvidia.com>
drew
left a comment
There was a problem hiding this comment.
This proposal looks good to me. I think the phases look correct, and it cleanly opens up compute drivers for extension.
|
|
||
| ## Motivation | ||
|
|
||
| The gateway's compute subsystem owns sandbox lifecycle and delegates platform-specific work to a compute driver through the `ComputeDriver` trait (`CreateSandbox`, `DeleteSandbox`, `WatchSandboxes`, `Reconcile`). Some drivers implement that trait directly in-process; others are subprocesses that the gateway reaches through a gRPC client that implements the same trait. Either way the boundary is intentionally narrow: it only covers what the driver itself must do. |
There was a problem hiding this comment.
nit: Reconcile isn't part of hte compute driver, but rather implemented by the gateway
| ## Open questions | ||
|
|
||
| - Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory. | ||
| - Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability. |
There was a problem hiding this comment.
I'd go with start internal, and then open for extension later.
|
|
||
| - Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory. | ||
| - Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability. | ||
| - Should we add a `dry_run` phase for previewing planned mutations without committing? Useful for the policy advisor; out of scope for v1. |
| - Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory. | ||
| - Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability. | ||
| - Should we add a `dry_run` phase for previewing planned mutations without committing? Useful for the policy advisor; out of scope for v1. | ||
| - Should extensions be able to attach their own tracing spans and OCSF events beyond what the framework supplies? Lean yes; `LifecycleContext` should carry a span that extensions extend. |
There was a problem hiding this comment.
+1 to enabling this. I can see this be very useful to add additional obs signals.
| 2. Wire the extension chain into the compute subsystem's create, delete, and reconcile paths. Empty and single-element chains are the default and exhibit no behavior change. | ||
| 3. Extend the gateway sandbox-state store with a `(sandbox_id, extension_name) → ExtensionState` namespace and a startup reconcile pass over it. | ||
| 4. Add the `compute.extensions` field to the gateway configuration file and the matching environment variable. Validate names against the compiled-in registry at startup. | ||
| 5. Ship a reference no-op extension in `crates/openshell-extension-example` as a template for downstream authors. |
There was a problem hiding this comment.
Do we want this in the main tree, or perhaps in examples/openshell-extension-example?
| [compute] | ||
| extensions = ["fleet-labels", "workload-identity"] | ||
| ``` | ||
|
|
There was a problem hiding this comment.
How would individual extensions be configured? Maybe something like
[compute.extension."fleet-labels"]
labels = { fleet = "prod", region = "us-west" }
[compute.extension."workload-identity"]
provider = "aws"
role_arn = "arn:aws:iam::123456789012:role/openshell-sandbox"|
|
||
| ## Open questions | ||
|
|
||
| - Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory. |
There was a problem hiding this comment.
advisory seems right to start
Summary
Related Issue
Changes
Testing
mise run pre-commitpassesChecklist