Plan agent connectivity to the backend API from outside the cluster

## Background

The SmartEM Agent runs on Windows EPU workstations alongside microscopes, outside the k8s cluster that hosts smartem-decisions. It needs to reach the backend API to:

- Read from / write to acquisition data over REST (CRUD on sessions, gridsquares, foilholes, etc.)
- Receive ML recommendations via SSE
- Authenticate with Keycloak using the `SmartEM_Agent` confidential client (client-credentials grant, per smartem-decisions#284)

## Current state

- **Development (k3s)**: the agent reaches the backend via the `smartem-http-api-service` NodePort 30080 (`http://<node-ip>:30080`). Works only because the dev cluster lives on the same network as the developer machine.
- **Staging / production**: no defined story. The frontend k8s manifests landing in #205 cover browser traffic only — the SPA pod's own nginx reverse-proxies `/api/` to `smartem-http-api-service` *internally* inside the pod, so the existing frontend ingress is **not** a route the agent can use from outside the cluster.

## What needs deciding

A deployment-friendly story for the agent's outside-cluster connectivity to the backend in non-dev environments. Sketch of the option space:

**Option A — Separate backend ingress**

- New `k8s/environments/{staging,production}/smartem-http-api-ingress.yaml` routing a dedicated host (e.g. `smartem-api-staging.diamond.ac.uk` / `smartem-api.diamond.ac.uk`) to `smartem-http-api-service`.
- Pros: clean separation, agent connects to a stable, well-named host; independent failure domain from the frontend; sizing matches workload (SSE + bulk REST, not browser navigations).
- Cons: extra TLS cert, extra DNS record, extra ingress rule.

**Option B — Agent traffic via the frontend ingress**

- Reuse `smartem-staging.diamond.ac.uk` / `smartem.diamond.ac.uk`. Either (i) keep the SPA pod's nginx in the path, or (ii) add a second backend rule alongside `/` so the cluster ingress controller proxies `/api/` to `smartem-http-api-service` directly.
- Pros: one hostname, one cert, one ingress rule (variant ii also one fewer hop).
- Cons (variant i): couples agent traffic to the SPA pod's nginx, intertwining failure modes; SPA pod sized for browser traffic, not N concurrent SSE streams. (variant ii): mixes user-facing and machine-facing traffic on the same name; same-origin is irrelevant for the agent (not browser-based).

**Option C — LoadBalancer on the backend service**

- Set `smartem-http-api-service.type: LoadBalancer` in staging/production (or sit a MetalLB / on-prem LB in front).
- Pros: simple, no ingress controller involved.
- Cons: on-prem LB scarcity; no TLS termination by default; one LB IP per service.

**Option D — Other**

E.g. service mesh, per-microscope tunnel, agent goes through a relay. Probably not warranted for the current shape of the workload but worth a brief mention.

## Constraints to factor in

- **Auth**: agent uses `SmartEM_Agent` client-credentials against the DLS Keycloak realm. The backend already accepts tokens with `azp: SmartEM_Agent` (added to `KEYCLOAK_ALLOWED_AZP` in #205). No CORS concerns since the agent isn't a browser.
- **TLS**: agent traffic should be TLS-terminated at the ingress in non-dev environments. The agent does not need to live on the DLS internal network if a properly-secured public ingress is exposed.
- **SSE**: the agent subscribes to ML recommendations via long-lived SSE streams. Whichever route is chosen must support that (ingress controller timeouts, response buffering off, keep-alives).
- **Scale**: multiple agents per facility, each holding at least one SSE connection plus periodic REST traffic.
- **Locality**: on-prem at DLS the agent and cluster will share the DLS network; the path can be much shorter than a public ingress. Worth deciding whether to design for a single deployment shape or two (DLS-internal vs federated facility).

## Related

- smartem-decisions#284 — agent auth strategy (closed: Keycloak client-credentials with `SmartEM_Agent`)
- smartem-devtools#205 — frontend k8s deploy (adds the frontend ingress; explicitly defers this agent connectivity story)
- smartem-devtools#181 — broader k8s modernisation (Gateway API, Ingress, ClusterIP); overlapping scope, this issue is the narrower agent-specific slice
- smartem-devtools#179 — staging/production manifests vs on-prem reality (where this lands in practice)

## Out of scope

- Implementing the choice. This issue is to decide; a follow-up tracks the manifest additions.
- The frontend's connectivity to the backend (already solved via SPA-pod-internal nginx proxy in #205).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan agent connectivity to the backend API from outside the cluster #206

Background

Current state

What needs deciding

Constraints to factor in

Related

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Plan agent connectivity to the backend API from outside the cluster #206

Description

Background

Current state

What needs deciding

Constraints to factor in

Related

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions