From da4071241d0399c76cae35a0b02c2dd6d58f2cd6 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Tue, 9 Jun 2026 13:40:27 +0000 Subject: [PATCH 1/3] feat(coder-templates/firewalled): add landjail-firewalled Claude Code template Add a new "firewalled" workspace template: the claude-code template with the Coder Boundary agent firewall enabled. Claude Code runs inside a landjail (Landlock LSM) process-level network egress jail that denies all HTTP(S) egress except an allowlist (the in-boundary AI Gateway and in-cluster GitLab). Every denied request is audit-logged to coderd with owner, workspace, agent, URL, and template attribution. Wiring (claude-code module 4.7.3): enable_boundary=true, use_boundary_directly=true (standalone boundary binary; the coder boundary subcommand needs a logged-in CLI session the agent lacks), and a pre_install_script that writes ~/.config/coder_boundary/config.yaml with the allowlist and jail_type=landjail before Claude Code launches. Validated live on dev.usgov.coderdemo.io: build succeeds, the process tree shows agentapi -> boundary -> claude, allow/deny enforced (gateway 200, gitlab 302, example.com 403, github.com 403), and coderd emits boundary_request audit lines for Claude Code's own blocked telemetry egress. Generated by Coder Agents. --- coder-templates/firewalled/README.md | 238 ++++++++++++ coder-templates/firewalled/main.tf | 527 +++++++++++++++++++++++++++ 2 files changed, 765 insertions(+) create mode 100644 coder-templates/firewalled/README.md create mode 100644 coder-templates/firewalled/main.tf diff --git a/coder-templates/firewalled/README.md b/coder-templates/firewalled/README.md new file mode 100644 index 0000000..7506ed5 --- /dev/null +++ b/coder-templates/firewalled/README.md @@ -0,0 +1,238 @@ +# Firewalled Claude Code on Coder Agents (GovCloud demo template) + +Coder workspace template that runs **Claude Code as a Coder Agent** inside a +Kubernetes pod on the EKS cluster, wired through the **Coder AI Gateway (AI +Bridge)** and wrapped in the **Coder Boundary agent firewall**. The workspace +never holds a raw Anthropic API key: every request is +proxied through Coder using the workspace owner's session token and routed to +the configured provider (Anthropic-direct primary, Bedrock secondary) +in-boundary. + +This is the `claude-code` template with the agent firewall turned on. Claude +Code runs inside a process-level network egress jail (`landjail`, Landlock +LSM) that denies all HTTP(S) egress except an allowlist. The agent can reach +the in-boundary AI Gateway and the in-cluster GitLab; every other destination +is denied and audit-logged. This is the data-exfiltration / DLP guardrail +story for the AOI. + +Launching the template as a **Coder Task** opens the Claude Code chat UI and +seeds the agent with the task prompt. + +- `main.tf`: the template (providers `coder` + `kubernetes`). +- Workspace image: `codercom/enterprise-base:ubuntu-noble-20260601`, pulled + from the ECR mirror. + +## Agent firewall (Coder Boundary) + +The `module "claude_code"` block sets `enable_boundary = true` and +`use_boundary_directly = true`, so the module installs the standalone +`boundary` binary and launches `boundary -- claude`. The allowlist and jail +type are read from `~/.config/coder_boundary/config.yaml`, written by the +module `pre_install_script` before Claude Code starts: + +```yaml +allowlist: + - "domain=dev.usgov.coderdemo.io" # AI Gateway egress (REQUIRED) + - "domain=gitlab.usgov.coderdemo.io" # in-cluster GitLab SCM +jail_type: landjail +log_dir: /tmp/boundary_logs +log_level: warn +``` + +Why `use_boundary_directly = true`: the default `coder boundary` subcommand +verifies the deployment license via an authenticated client, but the agent +carries only an agent token (no user session), so the subcommand errors with +"not logged in". The standalone binary (MIT) has no license/login dependency. +landjail needs no added pod capabilities; the AL2023 node kernel (6.18) is +well past the Landlock 6.7 floor and `landlock` is in the node LSM stack. + +### Verify allow vs deny in a workspace terminal + +```bash +# Allowed: the AI Gateway host returns 200 +boundary -- curl -sS -o /dev/null -w '%{http_code}\n' \ + https://dev.usgov.coderdemo.io/api/v2/buildinfo + +# Denied: anything off the allowlist is blocked (boundary returns 403) +boundary -- curl -sS -o /dev/null -w '%{http_code}\n' https://example.com +``` + +Claude Code itself keeps working because its `ANTHROPIC_BASE_URL` points at +the allowlisted gateway host. To roll back to an un-firewalled workspace, use +the `claude-code` template instead (or set `enable_boundary = false`). + +## What's inside + +| Piece | Resource | Notes | +|---|---|---| +| Agent | `coder_agent.main` | startup script, metadata, `display_apps` (VS Code Desktop, web terminal, SSH) | +| Claude Code | `module.claude_code` (`registry.coder.com/coder/claude-code/coder` **4.7.3**) | `enable_aibridge = true`, bundles AgentAPI + Claude Code web app, outputs `task_app_id` | +| Coder Task | `coder_ai_task.claude_code` | binds the Task UI to the Claude Code app; only created in a Task context | +| Browser IDE | `module.code_server` (`code-server` 1.3.1) | extra `coder_app` tile | +| Compute | `kubernetes_pod_v1.workspace` + `kubernetes_persistent_volume_claim_v1.home` | sizing from `cpu` / `memory` / `disk_size` parameters | +| AI auth | `coder_env.anthropic_auth_token` | exports `ANTHROPIC_AUTH_TOKEN` = session token | + +Parameters: `cpu`, `memory`, `disk_size`, and `ai_prompt` (fallback prompt for +non-Task builds). + +## AI Gateway wiring (end to end) + +1. The `claude_code` module is configured with `enable_aibridge = true`. On the + agent it sets: + - `ANTHROPIC_BASE_URL = /api/v2/aibridge/anthropic` + - `CLAUDE_API_KEY = ` + + With `CODER_ACCESS_URL=https://dev.usgov.coderdemo.io` the base URL resolves + to `https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic`. +2. This template additionally exports `ANTHROPIC_AUTH_TOKEN` (the same session + token) to match the AI Gateway client contract in `deploy/CONVENTIONS.md`. +3. Claude Code calls `ANTHROPIC_BASE_URL`. The Coder AI Gateway authenticates + the session token, applies governance/audit, and forwards the request to the + active provider: + - **Anthropic-direct** (primary): egress via the NAT gateway. + - **Bedrock** (secondary): IRSA on the `coder/coder` service account, model + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`, in-region only. + +No Anthropic key is stored in the workspace; the session token is the only +credential and it is scoped to the workspace owner. + +### Model selection + +Model is left at the module default on purpose, because the requested model +name must match whichever provider the Gateway has live: + +- Anthropic-direct: an Anthropic id, e.g. `claude-sonnet-4-5-20250929`. +- Bedrock (GovCloud): the inference profile + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`. + +Pin one by uncommenting `model = "..."` in the module block once the live +provider is confirmed. Bedrock Claude access was still gated at authoring time +(see `STATUS.md`), so the safe default is to let Claude Code/Gateway negotiate. + +### Why module 4.7.3 and `enable_aibridge` (not `enable_ai_gateway`) + +Verified against the Coder registry: + +- `deploy/CONVENTIONS.md` and `versions.lock.yaml` pin the claude-code module + to **4.7.3**. +- In **4.7.x the input is `enable_aibridge`**. The `enable_ai_gateway` rename + (and an `ANTHROPIC_AUTH_TOKEN` the module sets itself) only appear in the + **5.x** line. +- The 5.x refactor **removed** the bundled AgentAPI integration and the + `task_app_id` output, which `coder_ai_task` requires. Staying on 4.7.3 is what + makes the Coder Tasks wiring in this template work. + +If the project later moves to claude-code 5.x, switch `enable_aibridge` → +`enable_ai_gateway`, drop the explicit `coder_env.anthropic_auth_token`, and add +a standalone `agentapi` module to supply `task_app_id` for `coder_ai_task`. + +## Cluster prerequisites + +The platform layer (Coder server + ingress + namespaces) is out of scope for +this directory. Before pushing/using the template, ensure: + +1. **Coder server** 2.34.0 with the AI Governance add-on license and the AI + Gateway providers configured (Anthropic-direct + Bedrock). See + `deploy/coder/`. +2. **Wildcard access URL** set so subdomain apps work + (`CODER_WILDCARD_ACCESS_URL=*.usgov.coderdemo.io`). The Claude Code web app + and code-server use `subdomain = true`. +3. **Workspaces namespace** exists: + + ```bash + kubectl create namespace coder-workspaces + ``` + +4. **Provisioner RBAC**: the Coder provisioner (service account `coder` in the + `coder` namespace) must be able to manage pods/PVCs in `coder-workspaces`. + Example (apply with the platform layer, not from this directory): + + ```yaml + apiVersion: rbac.authorization.k8s.io/v1 + kind: Role + metadata: + name: coder-workspace-provisioner + namespace: coder-workspaces + rules: + - apiGroups: [""] + resources: ["pods", "persistentvolumeclaims"] + verbs: ["create", "get", "list", "watch", "update", "patch", "delete"] + - apiGroups: [""] + resources: ["pods/exec", "pods/log"] + verbs: ["get", "create"] + - apiGroups: [""] + resources: ["events"] + verbs: ["get", "list", "watch"] + --- + apiVersion: rbac.authorization.k8s.io/v1 + kind: RoleBinding + metadata: + name: coder-workspace-provisioner + namespace: coder-workspaces + roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: coder-workspace-provisioner + subjects: + - kind: ServiceAccount + name: coder + namespace: coder + ``` + +5. **Image pull**: the EKS node IAM role needs ECR read + (`ecr:GetAuthorizationToken`, `ecr:BatchGetImage`, + `ecr:GetDownloadUrlForLayer`) for + `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com`. With that on the node + role, no `imagePullSecret` is required on the pod. The image must already be + mirrored into ECR (`scripts/mirror-images.sh`). + +## Pushing the template + +From the repo root: + +```bash +# First time: create the template. +coder templates push claude-code \ + --directory coder-templates/claude-code \ + --variable namespace=coder-workspaces + +# Subsequent updates push a new version. +coder templates push claude-code \ + --directory coder-templates/claude-code +``` + +Override the image or namespace at push time if needed: + +```bash +coder templates push claude-code \ + --directory coder-templates/claude-code \ + --variable namespace=coder-workspaces \ + --variable workspace_image=430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/codercom/enterprise-base:ubuntu-noble-20260601 +``` + +Template variables: + +| Variable | Default | Purpose | +|---|---|---| +| `namespace` | `coder-workspaces` | namespace for workspace pods | +| `workspace_image` | ECR-mirrored `enterprise-base` | workspace container image | +| `use_kubeconfig` | `false` | use a host kubeconfig instead of in-cluster config | + +## Using it + +- **As a workspace**: create a workspace from the template, open VS Code / + terminal / code-server, and run `claude` in the workspace. +- **As a Task**: create a Coder Task from this template and enter a prompt. + Coder injects the prompt via `data.coder_task.me.prompt`, the + `coder_ai_task` resource binds the Task UI to the Claude Code app, and the + agent reports status back to the Coder UI through AgentAPI. + +## Verification status + +| Item | Source | Status | +|---|---|---| +| claude-code 4.7.3 inputs (`enable_aibridge`, `workdir`, `ai_prompt`, `report_tasks`, `subdomain`) and `task_app_id` output | module `main.tf` / `README.md` at tag `release/coder/claude-code/v4.7.3` | verified | +| `coder_ai_task.app_id` + `data.coder_task` (`enabled`, `prompt`) | `coder/terraform-provider-coder` docs; first shipped in provider **v2.13.0** | verified | +| Workspace image tag | Docker Hub `codercom/enterprise-base` | verified (`ubuntu-noble-20260601`) | +| `code-server` 1.3.1 | registry tag `release/coder/code-server/v1.3.1` | verified (latest is 1.5.0) | +| Live AI Gateway routing / Bedrock model access | runtime cluster | NOT verified here (no live infra access; Bedrock Claude access gated per `STATUS.md`) | diff --git a/coder-templates/firewalled/main.tf b/coder-templates/firewalled/main.tf new file mode 100644 index 0000000..4a19f44 --- /dev/null +++ b/coder-templates/firewalled/main.tf @@ -0,0 +1,527 @@ +# ============================================================================= +# Firewalled Claude Code on Coder Agents, GovCloud demo workspace template +# ============================================================================= +# Identical to the claude-code template, with the Coder Boundary agent +# firewall enabled. Claude Code runs inside a process-level network egress +# jail (landjail / Landlock LSM) that enforces an HTTP(S) allowlist. The +# agent can reach the in-boundary AI Gateway and the in-cluster GitLab, and +# every other egress is denied and audit-logged. This is the data-exfil / +# DLP guardrail story for the AOI. +# +# Boundary wiring (claude-code module 4.7.3 inputs): +# - enable_boundary = true wraps Claude Code with the firewall. +# - use_boundary_directly = true installs the standalone boundary +# binary (MIT) instead of the `coder boundary` subcommand. The subcommand +# path needs a logged-in coder CLI session (license check); the agent has +# only an agent token, so the standalone binary is the reliable path. +# - The module adds no --allow / --jail-type flags, so the allowlist and +# jail type come from ~/.config/coder_boundary/config.yaml, written by +# pre_install_script below before Claude Code launches. +# +# Allowlist (config.yaml): dev.usgov.coderdemo.io (AI Gateway egress, +# REQUIRED or Claude Code breaks) and gitlab.usgov.coderdemo.io (SCM). +# jail_type landjail needs no added capabilities (AL2023 kernel 6.18 +# exceeds the Landlock 6.7 floor; landlock is in the node LSM stack). +# +# Runs Claude Code as a Coder Agent inside a Kubernetes pod on the EKS +# cluster. Claude Code is wired through the Coder AI Gateway (AI Bridge) +# so the workspace never holds a raw Anthropic key: requests are proxied +# through Coder using the workspace owner's session token and routed to +# the configured provider (Anthropic-direct primary / Bedrock secondary) +# in-boundary. +# +# Launching this template as a Coder Task surfaces the Claude Code chat UI +# (via the bundled AgentAPI app) and seeds the agent with the task prompt. +# +# VERSION / INPUT NAMING, verified against the Coder registry: +# - claude-code module is pinned to 4.7.3 (the version in +# deploy/CONVENTIONS.md / versions.lock.yaml). +# - In 4.7.3 the AI Gateway input is named `enable_aibridge` (NOT +# `enable_ai_gateway`). The `enable_ai_gateway` rename landed in the +# 5.x line, which also REMOVED the bundled AgentAPI integration and +# the `task_app_id` output that `coder_ai_task` depends on. Staying on +# 4.7.3 is what makes the Coder Tasks wiring below possible. +# - `enable_aibridge = true` makes the module set, on the agent: +# ANTHROPIC_BASE_URL = /api/v2/aibridge/anthropic +# CLAUDE_API_KEY = +# With CODER_ACCESS_URL=https://dev.usgov.coderdemo.io the base URL +# resolves to https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic. +# - We additionally export ANTHROPIC_AUTH_TOKEN (session token) to match +# the AI Gateway client contract in deploy/CONVENTIONS.md. +# +# See README.md for the end-to-end AI Gateway wiring and cluster +# prerequisites (namespace + provisioner RBAC). +# ============================================================================= + +terraform { + required_providers { + coder = { + source = "coder/coder" + # `data.coder_task` and `coder_ai_task.app_id` require provider >= 2.13.0. + version = ">= 2.13.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = ">= 2.23" + } + } +} + +# ----------------------------------------------------------------------------- +# Providers +# ----------------------------------------------------------------------------- + +provider "coder" {} + +variable "use_kubeconfig" { + type = bool + description = "Use a host kubeconfig instead of in-cluster config. Leave false when the Coder provisioner runs inside the cluster." + default = false +} + +variable "namespace" { + type = string + description = "Kubernetes namespace that hosts workspace pods. The platform layer must create this namespace and grant the provisioner RBAC (see README)." + default = "coder-workspaces" +} + +# Workspace container image (ECR mirror). +# +# Upstream ref : docker.io/codercom/enterprise-base:ubuntu-noble-20260601 +# ECR mirror : per deploy/CONVENTIONS.md the docker.io -> ECR mapping is +# docker.io/: -> /docker-hub/: +# +# codercom/enterprise-base is Coder's maintained Kubernetes workspace base +# image: runs as user `coder` (uid 1000), ships git/curl/sudo, and is the +# canonical base for Coder's official Kubernetes template. Claude Code and +# AgentAPI install as standalone binaries into $HOME/.local/bin, so no +# Node.js/npm is required in the base image. +variable "workspace_image" { + type = string + description = "Fully-qualified workspace image. Defaults to the ECR-mirrored codercom/enterprise-base." + default = "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/codercom/enterprise-base:ubuntu-noble-20260601" +} + +provider "kubernetes" { + config_path = var.use_kubeconfig ? "~/.kube/config" : null +} + +data "coder_provisioner" "me" {} +data "coder_workspace" "me" {} +data "coder_workspace_owner" "me" {} + +# Populated when the workspace is created as a Coder Task. `enabled` is +# false for a normal workspace build, and `prompt` carries the task prompt. +data "coder_task" "me" {} + +# ----------------------------------------------------------------------------- +# Git external auth: in-cluster GitLab (in-boundary) +# ----------------------------------------------------------------------------- +# Every workspace authenticates git against the in-cluster GitLab through +# Coder's external-auth provider `gitlab` (configured on the Coder server, see +# deploy/coder/values.yaml CODER_EXTERNAL_AUTH_0_*). Declaring this data source +# makes the workspace REQUIRE a GitLab login: the dashboard surfaces a "Login +# with GitLab" control and the agent only reports the auth as satisfied once +# the owner has completed the OAuth flow. The Coder agent's git credential +# helper then injects the short-lived OAuth token for any clone/fetch/push to +# gitlab.usgov.coderdemo.io. No PATs or SSH keys live in the workspace, and no +# auth path leaves the GovCloud boundary. +# +# id MUST match CODER_EXTERNAL_AUTH_0_ID on the Coder server ("gitlab"). +data "coder_external_auth" "gitlab" { + id = "gitlab" +} + +# ----------------------------------------------------------------------------- +# Parameters: sizing and the AI task prompt +# ----------------------------------------------------------------------------- + +data "coder_parameter" "cpu" { + name = "cpu" + display_name = "CPU Cores" + description = "CPU limit for the workspace pod." + type = "number" + default = "4" + mutable = true + icon = "/icon/memory.svg" + + option { + name = "2 Cores" + value = "2" + } + option { + name = "4 Cores" + value = "4" + } + option { + name = "8 Cores" + value = "8" + } +} + +data "coder_parameter" "memory" { + name = "memory" + display_name = "Memory (GB)" + description = "Memory limit for the workspace pod." + type = "number" + default = "8" + mutable = true + icon = "/icon/memory.svg" + + option { + name = "4 GB" + value = "4" + } + option { + name = "8 GB" + value = "8" + } + option { + name = "16 GB" + value = "16" + } +} + +data "coder_parameter" "disk_size" { + name = "disk_size" + display_name = "Disk Size (GB)" + description = "Persistent /home/coder volume size. Cannot be changed after creation." + type = "number" + default = "20" + mutable = false + icon = "/icon/database.svg" + + option { + name = "10 GB" + value = "10" + } + option { + name = "20 GB" + value = "20" + } + option { + name = "50 GB" + value = "50" + } +} + +# Fallback prompt for non-Task workspace builds. When the workspace is +# launched as a Coder Task, data.coder_task.me.prompt takes precedence. +data "coder_parameter" "ai_prompt" { + name = "ai_prompt" + display_name = "Initial AI Prompt" + description = "Seed prompt for Claude Code. Ignored when launched as a Coder Task (the Task prompt is used instead)." + type = "string" + default = "" + mutable = true + icon = "/icon/claude.svg" +} + +locals { + # Prefer the Coder Task prompt; fall back to the parameter for plain builds. + effective_prompt = data.coder_task.me.prompt != "" ? data.coder_task.me.prompt : data.coder_parameter.ai_prompt.value + + # For documentation/readme parity. The claude-code module derives the + # same value internally from data.coder_workspace.me.access_url. + ai_gateway_anthropic_url = "${data.coder_workspace.me.access_url}/api/v2/aibridge/anthropic" +} + +# ----------------------------------------------------------------------------- +# Agent +# ----------------------------------------------------------------------------- + +resource "coder_agent" "main" { + arch = data.coder_provisioner.me.arch + os = "linux" + + # Claude Code + AgentAPI are installed by the claude-code module's own + # coder_script (native binaries into $HOME/.local/bin). This startup + # script only normalizes PATH and signals readiness. + startup_script = <<-EOT + #!/bin/bash + set -e + touch ~/.bashrc + grep -qF '$HOME/.local/bin' ~/.profile 2>/dev/null || \ + echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.profile + echo "=== Workspace ready ===" + EOT + + env = { + EDITOR = "code" + VISUAL = "code" + + # No docker socket in the pod; opt out of devcontainer auto-detection + # so the dashboard does not hang polling `docker ps`. + CODER_AGENT_DEVCONTAINERS_ENABLE = "false" + } + + metadata { + display_name = "CPU Usage" + key = "cpu_usage" + script = "coder stat cpu" + interval = 10 + timeout = 1 + } + + metadata { + display_name = "Memory Usage" + key = "mem_usage" + script = "coder stat mem" + interval = 10 + timeout = 1 + } + + metadata { + display_name = "Disk Usage" + key = "disk_usage" + script = "coder stat disk --path /home/coder" + interval = 60 + timeout = 1 + } + + display_apps { + vscode = true + vscode_insiders = false + web_terminal = true + ssh_helper = true + port_forwarding_helper = true + } +} + +# ----------------------------------------------------------------------------- +# AI Gateway client auth +# ----------------------------------------------------------------------------- +# The claude-code module (enable_aibridge = true) already sets +# ANTHROPIC_BASE_URL and CLAUDE_API_KEY. We additionally export +# ANTHROPIC_AUTH_TOKEN with the workspace owner's session token to match +# the AI Gateway client contract documented in deploy/CONVENTIONS.md. Both +# carry the same session token, so there is no conflict; no raw Anthropic +# API key is ever placed in the workspace. +resource "coder_env" "anthropic_auth_token" { + agent_id = coder_agent.main.id + name = "ANTHROPIC_AUTH_TOKEN" + value = data.coder_workspace_owner.me.session_token +} + +# ----------------------------------------------------------------------------- +# Claude Code (Coder registry module) + Coder Task +# ----------------------------------------------------------------------------- + +module "claude_code" { + source = "registry.coder.com/coder/claude-code/coder" + version = "4.7.3" + agent_id = coder_agent.main.id + + # Required by the module: directory Claude Code runs in. Pre-created and + # trust-accepted by the module. + workdir = "/home/coder" + + # Route Claude Code through the Coder AI Gateway (AI Bridge) instead of + # talking to api.anthropic.com directly. Sets ANTHROPIC_BASE_URL + + # CLAUDE_API_KEY (session token) on the agent. Mutually exclusive with + # claude_api_key / claude_code_oauth_token. + enable_aibridge = true + + # --------------------------------------------------------------------------- + # Coder Boundary agent firewall (this is the "firewalled" variant) + # --------------------------------------------------------------------------- + # Wrap Claude Code in a process-level network egress jail. The module + # launches boundary as a wrapper around the claude process, denying all + # egress except the allowlist below. landjail uses the Landlock LSM and + # needs no added pod capabilities. + enable_boundary = true + + # Install the standalone boundary binary (MIT) rather than using the + # `coder boundary` subcommand. The subcommand verifies the deployment + # license via an authenticated client; the agent only carries an agent + # token (no user session), so the subcommand path errors with "not logged + # in". The standalone binary has no license/login dependency. + use_boundary_directly = true + boundary_version = "latest" + + # The 4.7.3 module passes no --allow / --jail-type flags to boundary, so + # this config file is the ONLY source of the allowlist and jail type. It + # must exist before Claude Code starts, so it is written in + # pre_install_script (runs before the start script that launches boundary). + # Allowing dev.usgov.coderdemo.io is REQUIRED: it is the AI Gateway egress + # that Claude Code depends on. Everything not listed is denied + audited. + pre_install_script = <<-EOT + #!/bin/bash + set -e + mkdir -p "$HOME/.config/coder_boundary" /tmp/boundary_logs + cfg="$HOME/.config/coder_boundary/config.yaml" + { + echo 'allowlist:' + echo ' - "domain=dev.usgov.coderdemo.io"' + echo ' - "domain=gitlab.usgov.coderdemo.io"' + echo 'jail_type: landjail' + echo 'log_dir: /tmp/boundary_logs' + echo 'log_level: warn' + } > "$cfg" + echo "[firewalled] wrote boundary config:" + cat "$cfg" + EOT + + # Coder Tasks: seed the agent and report task status to the Coder UI via + # AgentAPI. Empty string for plain builds -> Claude Code starts idle. + ai_prompt = local.effective_prompt + report_tasks = true + + # Serve the Claude Code web app on a subdomain. Requires the wildcard + # access URL (*.usgov.coderdemo.io) configured on the Coder server. + subdomain = true + + # Model selection is intentionally left at the module default. With the + # AI Gateway, the requested model name must match the active provider: + # - Anthropic-direct (primary): an Anthropic model id, e.g. + # "claude-sonnet-4-5-20250929". + # - Bedrock (secondary): the GovCloud inference profile, e.g. + # "us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0". + # Pin one explicitly only after confirming which provider is live: + # model = "claude-sonnet-4-5-20250929" +} + +# Marks this workspace build as a Coder AI Task and binds the Task UI to the +# Claude Code AgentAPI app. Only created in a Task context so normal +# workspace builds are unaffected. +resource "coder_ai_task" "claude_code" { + count = data.coder_task.me.enabled ? data.coder_workspace.me.start_count : 0 + app_id = module.claude_code.task_app_id +} + +# code-server: VS Code in the browser (an additional coder_app tile). +module "code_server" { + count = data.coder_workspace.me.start_count + source = "registry.coder.com/coder/code-server/coder" + version = "1.3.1" + agent_id = coder_agent.main.id + folder = "/home/coder" + subdomain = true + order = 1 +} + +# ----------------------------------------------------------------------------- +# Kubernetes resources +# ----------------------------------------------------------------------------- + +resource "kubernetes_persistent_volume_claim_v1" "home" { + metadata { + name = "coder-${data.coder_workspace.me.id}-home" + namespace = var.namespace + labels = { + "app.kubernetes.io/name" = "coder-workspace" + "app.kubernetes.io/instance" = "coder-${data.coder_workspace.me.id}" + "app.kubernetes.io/part-of" = "coder" + } + } + wait_until_bound = false + spec { + access_modes = ["ReadWriteOnce"] + resources { + requests = { + storage = "${data.coder_parameter.disk_size.value}Gi" + } + } + } + + lifecycle { + ignore_changes = all + } +} + +resource "kubernetes_pod_v1" "workspace" { + count = data.coder_workspace.me.start_count + + metadata { + name = "coder-${data.coder_workspace.me.id}" + namespace = var.namespace + labels = { + "app.kubernetes.io/name" = "coder-workspace" + "app.kubernetes.io/instance" = "coder-${data.coder_workspace.me.id}" + "app.kubernetes.io/part-of" = "coder" + } + } + + spec { + # enterprise-base runs as the `coder` user (uid/gid 1000). + security_context { + run_as_user = 1000 + fs_group = 1000 + } + + container { + name = "dev" + image = var.workspace_image + image_pull_policy = "IfNotPresent" + command = ["sh", "-c", coder_agent.main.init_script] + + security_context { + run_as_user = 1000 + # enterprise-base grants the coder user passwordless sudo. The + # claude-code/agentapi module installs the agentapi binary to + # /usr/local/bin via sudo, which requires privilege escalation. + # Disabling it sets the kernel no_new_privs flag and breaks that + # install (and the Coder Tasks chat UI it powers). + allow_privilege_escalation = true + } + + env { + name = "CODER_AGENT_TOKEN" + value = coder_agent.main.token + } + + env { + name = "CODER_AGENT_URL" + value = data.coder_workspace.me.access_url + } + + resources { + requests = { + "cpu" = "500m" + "memory" = "${max(2, floor(data.coder_parameter.memory.value / 2))}Gi" + } + limits = { + "cpu" = "${data.coder_parameter.cpu.value}" + "memory" = "${data.coder_parameter.memory.value}Gi" + } + } + + volume_mount { + mount_path = "/home/coder" + name = "home" + read_only = false + } + } + + volume { + name = "home" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim_v1.home.metadata[0].name + } + } + + affinity { + pod_anti_affinity { + preferred_during_scheduling_ignored_during_execution { + weight = 1 + pod_affinity_term { + topology_key = "kubernetes.io/hostname" + label_selector { + match_expressions { + key = "app.kubernetes.io/name" + operator = "In" + values = ["coder-workspace"] + } + } + } + } + } + } + } + + # The agent token is baked into init_script; ignore_changes keeps a + # running pod intact across template re-applies / prebuild claims. + lifecycle { + ignore_changes = all + } +} From b44344706988a3f758c8167cc14d2a879d43656c Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Tue, 9 Jun 2026 13:46:41 +0000 Subject: [PATCH 2/3] docs(aoi): add AOI gap remediation plan and task briefs Add the AOI gap remediation plan (firewall + authenticated MCP, with the firewall section updated to as-built and validated) and three execution-ready briefs so the remaining tasks can be run in parallel: - brief-github-auth-mcp.md: stand up an authenticated MCP (GitHub hosted MCP via PAT/OAuth), including the 200/202-vs-204 client gate to check first and an in-boundary datastore-mcp fallback. - brief-observability-audit-readiness.md: verify the boundary and AI Gateway Grafana dashboards and the Coder audit log show live demo data; confirms the boundary forwarded-batch metric name from source. - brief-template-golden-path-e2e.md: WS-25 per-template build + connectivity matrix, including the GitLab external-auth gate and the admin REST create-for-authenticated-owner workaround. Generated by Coder Agents. --- aoi/brief-github-auth-mcp.md | 303 +++++++++++++++++++++ aoi/brief-observability-audit-readiness.md | 261 ++++++++++++++++++ aoi/brief-template-golden-path-e2e.md | 215 +++++++++++++++ aoi/plan-firewall-and-auth-mcp.md | 284 +++++++++++++++++++ 4 files changed, 1063 insertions(+) create mode 100644 aoi/brief-github-auth-mcp.md create mode 100644 aoi/brief-observability-audit-readiness.md create mode 100644 aoi/brief-template-golden-path-e2e.md create mode 100644 aoi/plan-firewall-and-auth-mcp.md diff --git a/aoi/brief-github-auth-mcp.md b/aoi/brief-github-auth-mcp.md new file mode 100644 index 0000000..5a2d268 --- /dev/null +++ b/aoi/brief-github-auth-mcp.md @@ -0,0 +1,303 @@ +# Brief: Authenticated MCP Server in Coder Agents (GitHub hosted MCP) + +## 1. Objective and demo narrative + +Stand up an authenticated MCP server in Coder Agents on +`https://dev.usgov.coderdemo.io` that demonstrates real authentication plus +need-to-know. The approved backend is GitHub's hosted MCP +(`https://api.githubcopilot.com/mcp/`), accessed read-only with a fine-scoped +GitHub token. Narrative: "Coder Agents reaching an authenticated internal +service. The agent can only call tools the credential is allowed to call, and +each user sees only what their identity can access." Attribution (WS-23) is out +of scope. The single highest risk is a client/server protocol mismatch on +`notifications/initialized` (the 204 gate, see section 3), so verify the gate +before committing the demo to GitHub. + +## 2. Prerequisites + +- Admin Coder session token in `$TOKEN` and `CODER_URL=https://dev.usgov.coderdemo.io`. + Environment and admin token setup is documented elsewhere; assume it is ready. +- A fine-scoped GitHub Personal Access Token (PAT) from the user. Use a throwaway + demo org/repo to keep blast radius small. +- Recommended PAT scopes: + - Fine-grained, read-only: Contents Read, Metadata Read, Issues Read, + Pull Requests Read; optional Actions Read; org Members Read; Email Read. + - Classic alternative: `read:user`, `user:email`, `read:org`, `repo`, paired + with the `X-MCP-Readonly: true` header as defense in depth. +- For Path B only: ability to create a GitHub OAuth App in the chosen org. + +Field reference (verified against `codersdk/mcp.go`, +`CreateMCPServerConfigRequest`): `display_name` (required), `slug` (required), +`description`, `icon_url`, `transport` (required, oneof `streamable_http` `sse`), +`url` (required, url), `auth_type` (required, oneof `none` `oauth2` `api_key` +`custom_headers` `user_oidc`), `oauth2_client_id`, `oauth2_client_secret`, +`oauth2_auth_url`, `oauth2_token_url`, `oauth2_scopes`, `api_key_header`, +`api_key_value`, `custom_headers` (map of string to string), `tool_allow_list`, +`tool_deny_list`, `availability` (required, oneof `force_on` `default_on` +`default_off`), `enabled`, `model_intent`, `allow_in_plan_mode`, +`forward_coder_headers`. The POST returns HTTP 201 with the created object +including `id`. + +## 3. THE GATE: 204 vs 202 (verify FIRST) + +Coder's MCP client is `mark3labs/mcp-go` v0.38.0, which accepts only HTTP 200 or +202 on the `notifications/initialized` POST. GitLab's MCP returned 204 and was +dropped (CODAGT-570). GitHub's status on that notification is unverified, so this +gate decides whether GitHub MCP is usable as-is. + +Most authoritative procedure (register, then read coderd logs): + +1. Mint the PAT (section 2). +2. Register the GitHub MCP in Coder with `api_key` + the PAT (section 4 body). +3. Trigger a connection: open a Coder Agents chat with the server enabled, or + list servers, so coderd attempts to connect. +4. Watch coderd logs for a connection-failure line mentioning status 204: + +```sh +kubectl -n coder logs deploy/coder --tail=400 | \ + grep -iE "skipping MCP server.*connection failure|status 204|notifications/initialized" +``` + +Optional direct probe (confirms GitHub's behavior independent of Coder). Read +the status line on the `notifications/initialized` POST: + +```sh +# 1) initialize (capture the Mcp-Session-Id response header if present) +curl -sS -D - -o /dev/null -X POST https://api.githubcopilot.com/mcp/ \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -H "Accept: application/json, text/event-stream" \ + -H "X-MCP-Readonly: true" \ + --data '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"gate-check","version":"0.0.1"}}}' + +# 2) notifications/initialized (echo back Mcp-Session-Id from step 1 if returned) +curl -sS -D - -o /dev/null -X POST https://api.githubcopilot.com/mcp/ \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -H "Accept: application/json, text/event-stream" \ + -H "X-MCP-Readonly: true" \ + -H "Mcp-Session-Id: " \ + --data '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}' +``` + +Pass/fail decision: + +- Status 200 or 202, and no "skipping MCP server" line: PASS. Proceed with Path A + (or Path B for the per-user headline). +- Status 204, or coderd logs the connection-failure/204 line: FAIL. GitHub MCP is + unusable as-is. Switch to Fallback C (in-boundary datastore MCP), which we + control and can make return 202. + +## 4. Path A (recommended, fastest): api_key + PAT + +Simplest and genuinely authenticated; it is also the same registration that +clears the gate. Caveat: one PAT is one shared identity, so per-user need-to-know +requires either one server per demoed user (per-user PATs) or Path B. + +Exact JSON body. `api_key_value` is set verbatim, so it MUST include the +`Bearer ` prefix: + +```json +{ + "display_name": "GitHub (Internal Service)", + "slug": "github", + "description": "Read-only GitHub access via GitHub hosted MCP.", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "api_key", + "api_key_header": "Authorization", + "api_key_value": "Bearer ", + "tool_allow_list": [ + "get_me", + "search_repositories", + "get_repository", + "search_code", + "list_issues", + "get_issue", + "list_pull_requests", + "get_pull_request" + ], + "availability": "default_off", + "enabled": true +} +``` + +Register: + +```sh +curl -sS -X POST "$CODER_URL/api/experimental/mcp/servers" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data @path/to/body.json +``` + +X-MCP-Readonly header approach (important). The `api_key` auth type sends exactly +ONE header (`api_key_header`/`api_key_value`). It cannot also send a second static +header such as `X-MCP-Readonly: true`. Per `codersdk/mcp.go`, sending multiple +static headers requires `auth_type: custom_headers` with a `custom_headers` map. +To send both the bearer token and the read-only header, use this body instead: + +```json +{ + "display_name": "GitHub (Internal Service)", + "slug": "github", + "description": "Read-only GitHub access via GitHub hosted MCP.", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "custom_headers", + "custom_headers": { + "Authorization": "Bearer ", + "X-MCP-Readonly": "true" + }, + "tool_allow_list": [ + "get_me", + "search_repositories", + "get_repository", + "search_code", + "list_issues", + "get_issue", + "list_pull_requests", + "get_pull_request" + ], + "availability": "default_off", + "enabled": true +} +``` + +Recommendation: use the `custom_headers` body if you want `X-MCP-Readonly: true` +as defense in depth (preferred). Use the `api_key` body only if a single header is +acceptable and the PAT scopes alone enforce read-only. Keep `availability` +`default_off` and `enabled` true so the server exists but users opt in per chat. + +## 5. Path B (per-user RBAC headline): manual oauth2 + GitHub OAuth App + +Best per-user need-to-know story: each user clicks Connect once, Coder stores a +per-user GitHub token, and each user sees only what their GitHub identity allows. +GitHub advertises no DCR `registration_endpoint`, so oauth2 MUST be manual +(pre-registered GitHub OAuth App). For manual oauth2, supply ALL of +`oauth2_client_id`, `oauth2_auth_url`, and `oauth2_token_url`, otherwise Coder +attempts auto-DCR (which fails for GitHub). + +Callback sequencing problem: the OAuth App callback must be +`https://dev.usgov.coderdemo.io/api/experimental/mcp/servers/{id}/oauth2/callback`, +but `{id}` does not exist until the Coder MCP row is created. Resolve in this +order: + +1. Create the Coder MCP row first with placeholder oauth2 values so Coder mints + the `{id}` (returned in the 201 response): + +```json +{ + "display_name": "GitHub (Per-User)", + "slug": "github-oauth", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "oauth2", + "oauth2_client_id": "placeholder", + "oauth2_client_secret": "placeholder", + "oauth2_auth_url": "https://github.com/login/oauth/authorize", + "oauth2_token_url": "https://github.com/login/oauth/access_token", + "oauth2_scopes": "read:user user:email read:org repo", + "tool_allow_list": ["get_me", "search_repositories", "get_repository", "list_issues", "get_issue"], + "availability": "default_off", + "enabled": false +} +``` + +2. Create (or edit) the GitHub OAuth App and set its Authorization callback URL to + `https://dev.usgov.coderdemo.io/api/experimental/mcp/servers/{id}/oauth2/callback` + using the `{id}` from step 1. +3. Patch the Coder row with the real client id/secret and enable it: + +```sh +curl -sS -X PATCH "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data '{"oauth2_client_id":"","oauth2_client_secret":"","enabled":true}' +``` + +4. Each user opens the connect URL + (`$CODER_URL/api/experimental/mcp/servers/{id}/oauth2/connect`) from the chat UI, + authorizes once, and Coder stores their per-user token. Note: oauth2 does not + carry the `X-MCP-Readonly` header; enforce read-only via scopes and + `tool_allow_list`. + +## 6. Fallback C (in-boundary, clean optics): authenticated datastore MCP + +If the gate fails or egress optics must stay inside the GovCloud boundary, add +auth to the existing datastore MCP (`deploy/datastore-mcp`). It currently runs as +`auth_type: none` at +`http://datastore-mcp.coder-demo-mcp.svc.cluster.local:8000/mcp` and is reached +in-cluster. Because we own the code, we control the `notifications/initialized` +response and can guarantee the 202 gate passes. Ranked options: + +1. Manual `oauth2` via Keycloak: real per-user auth, in-boundary, best optics. The + MCP server must validate the access token (issuer, audience, expiry) and map + the subject to authorized rows. Supply Keycloak `oauth2_auth_url`, + `oauth2_token_url`, `oauth2_client_id`, `oauth2_client_secret`, `oauth2_scopes`, + and set the Keycloak client callback to the Coder + `/oauth2/callback` URL for that server `{id}` (same sequencing as Path B). +2. `user_oidc`: Coder forwards the user's OIDC token to the MCP server, which must + verify the audience and enforce per-user access. Less setup than full oauth2, + still per-user. +3. `api_key`: shared static credential, simplest, but a single shared identity (no + per-user need-to-know). + +Implementation note: the current datastore server does not validate the inbound +Authorization header (see `server/main.go`), so options 1 and 2 require adding +token verification before they are a true auth demo. Option 3 only requires Coder +to send the header and the server to check it. + +## 7. Verification + +- Connected: re-run the section 3 log grep and confirm NO "skipping MCP server" + line for the slug. Optionally `GET $CODER_URL/api/experimental/mcp/servers` and + confirm the row is present with `enabled: true`. +- Visible to the model: open a Coder Agents chat, enable the server (it is + `default_off`), and confirm the tools appear in the chat tools listing / + model picker as `github__` (datastore tools appear as `datastore__`, + same `slug__tool` convention). +- Smoke test (read-only): ask the agent to call a read-only tool, for example + `github__get_me` ("who am I authenticated as?") or + `github__search_repositories` against the throwaway demo org. Confirm it returns + real data and that a write-style tool is absent because it is not in + `tool_allow_list`. + +## 8. Rollback + +- Disable (keep the row): PATCH `enabled:false`. + +```sh +curl -sS -X PATCH "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data '{"enabled":false}' +``` + +- Delete (remove the row): DELETE returns HTTP 204. + +```sh +curl -sS -X DELETE "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" +``` + +- Revoke the PAT or the GitHub OAuth App in GitHub after the demo. For Path B, + users can also disconnect their token via + `DELETE $CODER_URL/api/experimental/mcp/servers/{id}/oauth2/disconnect`. + +## 9. Risks and open questions + +- 204 gate (highest risk): if GitHub returns 204 on `notifications/initialized`, + GitHub MCP is unusable as-is and the demo must use Fallback C. Verify before + committing. +- Egress / optics: GitHub MCP egresses to public GitHub, so packets and tokens + leave the GovCloud boundary even though the narrative says "internal service." + Mitigate with read-only tools, `X-MCP-Readonly: true`, a scoped PAT, and a + throwaway org/repo. If optics must stay in-boundary, make Fallback C primary. +- Shared vs per-user identity: Path A (api_key) is one shared identity. The + per-user need-to-know headline needs Path B (oauth2) or one server per user. +- The MCP servers config is a live, DB-resident object, not in git, so the row + must be recreated by hand if the database is reset. +- Open: which GitHub org/repos for the PAT or OAuth App? Is calling `github.com` + acceptable for demo optics, or must the authenticated MCP stay in-boundary + (then Fallback C is primary)? Auth headline preference: per-user RBAC (oauth2) + or fastest-authenticated (api_key)? + +Generated by Coder Agents. diff --git a/aoi/brief-observability-audit-readiness.md b/aoi/brief-observability-audit-readiness.md new file mode 100644 index 0000000..e149f68 --- /dev/null +++ b/aoi/brief-observability-audit-readiness.md @@ -0,0 +1,261 @@ +# Brief: Observability and Audit Readiness for the Thursday Demo + +Execution-ready verification brief. Read-only. Another agent will execute it. + +Authoritative context (verified this session): + +- Deployment: https://dev.usgov.coderdemo.io, Coder v2.34.1 enterprise, GovCloud + EKS, namespace `coder`. AI Governance add-on entitled (AI Bridge + Boundary). +- Coder Boundary (Agent Firewall) is enabled on a "firewalled" template. A live + jailed workspace `austenplatform/firewall-test` is running. coderd now emits + structured `boundary_request` audit lines (msg=boundary_request), visible via + `kubectl -n coder logs deploy/coder`. Source: + `/home/coder/coder/coderd/agentapi/boundary_logs.go`. +- Observability assets base path (this is where the files actually live; the + repo-relative form `deploy/observability/*` is used below): + `/home/coder/demoenv-workspace/usgov-phase2/deploy/observability/`. +- Dashboards present: `dashboards-boundary.yaml` (uid `agent-firewall`), + `dashboards-aibridge.yaml` (uid `ai-gateway`), `dashboards-coder.yaml`. + Datasources: `loki` (Loki), `prometheus` (Prometheus), `aibridge-postgres` + (Coder RDS Postgres, read-only role `grafana_ro`). + +## 1. Objective + +Confirm that the audit and observability surfaces show live data for the +Thursday demo flow: + +1. Agent Firewall egress allow/deny (Boundary), via the `agent-firewall` + Grafana dashboard backed by Loki `boundary_request` events. +2. AI Gateway usage (AI Bridge): providers, interceptions, tokens, and cost, + via the `ai-gateway` dashboard backed by the `aibridge-postgres` datasource. +3. Coder audit log: template pushes, workspace builds, and governance changes + (MCP/spend limits), via the Coder UI `/audit` and API `/api/v2/audit`. + +The deliverable for the executing agent is a pass/fail check against each +surface, plus the one concrete fix in section 7. + +## 2. Boundary (Agent Firewall) dashboard verification + +Dashboard: `dashboards-boundary.yaml`, uid `agent-firewall`, title +"Agent Firewall". Row "Coder Agent Firewall" holds the audit panels; row +"Agent Firewall Operations" holds Prometheus and proxy-log panels. + +### 2a. Confirm Loki ingests coderd logs + +Promtail scrapes all namespaces with no namespace filter (see +`promtail.yaml`, it maps `__meta_kubernetes_namespace` to label `namespace`), +so coderd logs in namespace `coder` are ingested. The audit panels select +`{namespace=~`(coder|coder-workspaces)`}`, which covers coderd. + +Verify ingestion (Grafana Explore, datasource Loki, or LogCLI): + +``` +{namespace=~"(coder|coder-workspaces)"} |= "boundary_request" | logfmt | decision=~"deny|allow" +``` + +Expect non-empty results. Boundary is jailing Claude Code in +`firewall-test`, which produces continuous deny events (for example +`api.anthropic.com` and `raw.githubusercontent.com`), and allowed events for +gateway traffic to `dev.usgov.coderdemo.io`. + +### 2b. Panels to check (exact panel titles and queries) + +- "Egress Audit (allow / deny)" (Loki, uid `loki`): + +``` +sum by (decision) (count_over_time({namespace=~`(coder|coder-workspaces)`} |= `boundary_request` | logfmt | decision=~`deny|allow` | owner=~`$owner` | domain=~`$domain` | template_id=~`$template_id` | template_version_id=~`$template_version_id` [$__range])) +``` + +- "Top Allowed Domains" and "Top Denied Domains" (Loki) parse the domain from + `http_url` with `regexp` and `topk(20, sum by (domain) (...))`. +- "Most recent allowed requests" and "Most recent denied requests" (Loki) use + `decision=`allow`` / `decision=`deny`` and `line_format` over fields + `event_time`, `http_method`, `domain`, `path`, `owner`, `workspace_name`, + `template_id`, `template_version_id`. + +Dashboard variables (`domain`, `owner`, `template_id`, `template_version_id`) +are textbox type, default empty. Empty regex matches all, so the allow/deny +panels populate with no variables set. Leave them blank for the demo unless +filtering to `austenplatform`. + +Field dependency to confirm on a real line: the `line_format` and the +domain `topk` panels assume the live `boundary_request` line contains +`owner`, `workspace_name`, and a parseable `http_url`. The emitter in +`boundary_logs.go` writes `decision`, `workspace_id`, `template_id`, +`template_version_id`, `http_method`, `http_url`, `event_time`, and +`matched_rule` (allow only); `owner`/`workspace_name`/`agent_name` are added by +the parent logger. Inspect one real line and confirm those fields are present: + +``` +kubectl -n coder logs deploy/coder --since=15m | grep boundary_request | head -3 +``` + +If `owner` or `workspace_name` are absent, the allow/deny counts still work +(missing label matches the empty regex), but the recent-request tables show +blank owner/workspace columns. Record this as an observation, not a blocker. + +### 2c. Generate fresh allow/deny events on demand + +From a workspace terminal on the firewalled template: + +- Deny: `boundary --proxy-port 8091 -- curl https://example.com` +- Allow: `curl https://dev.usgov.coderdemo.io` + +The firewalled template's Claude Code already emits continuous deny events, so +fresh generation is optional for the demo. + +## 3. Prometheus metric-name reconciliation + +Dashboard `dashboards-boundary.yaml` uses +`agent_boundary_log_proxy_batches_forwarded_total` in panels "Total Batches +Forwarded", "Active Firewall Agents", and "Forwarded Batches by Workspace". + +Source of truth (`/home/coder/coder/agent/boundarylogproxy/metrics.go`): + +``` +Namespace: "agent" +Subsystem: "boundary_log_proxy" +Name: "batches_forwarded_total" +``` + +Prometheus joins these as `agent_boundary_log_proxy_batches_forwarded_total`. +Therefore the dashboard name is correct, and the prefix-less spelling +`boundary_log_proxy_batches_forwarded_total` cited in two phase2 docs is wrong. + +Confirm the exported name against the live stack (any one): + +``` +# Prometheus label values +curl -s http:///api/v1/label/__name__/values | jq -r '.data[]' | grep -i boundary + +# coderd aggregated agent metrics (this metric is an agent metric aggregated by coderd) +kubectl -n coder exec deploy/coder -- wget -qO- http://localhost:2112/metrics | grep -i boundary +``` + +Expect `agent_boundary_log_proxy_batches_forwarded_total` (plus +`agent_boundary_log_proxy_batches_dropped_total` and +`agent_boundary_log_proxy_logs_dropped_total`). The metric carries labels +`username`, `workspace_name`, `agent_name` from the coderd aggregator, which +the "Forwarded Batches by Workspace" panel groups by (`workspace_name`, +`username`). + +If the live label name turns out to differ from the source, prefer fixing the +dashboard to match the live name. Based on source, no dashboard change is +expected; the fix belongs in the docs (section 7). + +## 4. AI Bridge (AI Gateway) dashboard verification + +Dashboard: `dashboards-aibridge.yaml`, uid `ai-gateway`, title "AI Gateway". + +### 4a. Confirm the Postgres datasource is connected + +Datasource `aibridge-postgres` (`datasource-aibridge-postgres.yaml`) points to +`usgov-coderdemo-pg...rds.amazonaws.com:5432`, database `coder`, user +`grafana_ro`, password from env `${AIGOV_DB_PASSWORD}` (Secret +`aigov-grafana-db` in namespace `monitoring`). Verify in Grafana: +Connections, Data sources, "AI Gateway DB", Save & test, expect success. + +### 4b. Panels showing live data (Postgres) + +- "Total Interceptions": `SELECT count(*) AS value FROM aibridge_interceptions WHERE $__timeFilter(started_at)` +- "Active Sessions": `count(DISTINCT session_id)` over `aibridge_interceptions` +- "Unique Users": `count(DISTINCT initiator_id)` over `aibridge_interceptions` +- "Interceptions by Provider/Model/User", "Recent Interceptions", "Sessions". + +Usage and cost panels ("Input/Output/Cache/Total Tokens", "Estimated Cost", +"Tokens Over Time", "Estimated Cost Over Time", "Top Users by Usage & Cost", +"Token Usage Detail") read from `aibridge_token_usages` joined to +`ai_model_prices` (71 rows, includes `claude-sonnet-4-5`). Confirm whether +token rows exist; if the Anthropic key in use is a placeholder, these can be +zero by design. Because the gateway has been used this session, confirm live +token/cost data is present and call it out if still zero. + +Provider-health stats ("Configured Providers", "Provider Reload Status", +"Last Successful Reload", "Provider Inventory") come from Prometheus +`coder_aibridged_*`; the "AI Gateway Log Stream" and event-rate panels come +from Loki (namespace `coder`). Confirm each row renders without datasource +errors. + +## 5. Coder audit log verification + +UI: open `https://dev.usgov.coderdemo.io/audit` as an admin. API: + +``` +curl -sS -H "Coder-Session-Token: $CODER_SESSION_TOKEN" \ + "https://dev.usgov.coderdemo.io/api/v2/audit?limit=50" | jq '.audit_logs[] | {action, resource_type, time}' +``` + +Confirm the log records the demo-relevant actions: + +- Template pushes / new template versions (resource_type `template` or + `template_version`, action `create`/`write`), including the firewalled + template. +- Workspace builds (resource_type `workspace_build` / `workspace`). +- Governance changes for the demo: MCP server config and spend-limit changes + (filter the UI by the relevant resource type, or grep the API response for + the changed fields). Confirm at least one such entry exists; if none, perform + one change before the demo so it appears. + +Note the audit log (Postgres `audit_logs`) is distinct from the +`boundary_request` application logs in Loki. Both must be checked. + +## 6. Demo-day checklist (5 minutes) + +1. Grafana, dashboard "Agent Firewall": "Egress Audit (allow / deny)" shows + both allow and deny in the last 15m. If flat, run the deny/allow curls in + section 2c. +2. Same dashboard: "Top Denied Domains" lists `api.anthropic.com` / + `raw.githubusercontent.com`; "Most recent denied requests" table populated. +3. Same dashboard: "Total Batches Forwarded" stat is non-zero (Prometheus). +4. Grafana, dashboard "AI Gateway": "Total Interceptions", "Active Sessions", + "Unique Users" non-zero; "Interceptions by Provider" populated. If tokens + were generated, confirm "Estimated Cost" non-zero. +5. Coder UI `/audit`: a recent template push and a workspace build are visible. +6. Confirm no panel shows a red datasource error (loki, prometheus, + aibridge-postgres all healthy under Grafana, Connections, Data sources). + +## 7. Concrete fixes found (described only, do not edit) + +One fix, in docs (the dashboard is already correct): + +- File: `deploy/observability/../docs/architecture/agent-firewall-feasibility.md` + (absolute: `/home/coder/demoenv-workspace/usgov-phase2/docs/architecture/agent-firewall-feasibility.md`), + line 101. Replace `boundary_log_proxy_batches_forwarded_total` with + `agent_boundary_log_proxy_batches_forwarded_total`. +- File: + `/home/coder/demoenv-workspace/usgov-phase2/aoi/plan-firewall-and-auth-mcp.md`, + line 131. Same replacement: add the `agent_` prefix so the cited metric + matches the exported name and the dashboard. + +Stale-doc note (optional, lower priority): both +`deploy/observability/AI_GOVERNANCE_DASHBOARD.md` (around lines 138 to 144) and +the header comment of `deploy/observability/dashboards-boundary.yaml` (around +lines 25 to 27) state that `boundary_request` allow/deny events "are not +emitted in this stack yet". That is now false on Coder v2.34.1; coderd emits +them and the `agent-firewall` dashboard's allow/deny panels populate. If time +allows, update that prose to reflect that allow/deny audit is now live. Do not +change any panel JSON; the queries are correct. + +No dashboard JSON edits are required. + +## 8. Risks and open questions + +- Token/cost panels depend on real metered AI traffic. If the Anthropic key is + a placeholder, `aibridge_token_usages` may be empty and cost reads zero by + design. Confirm live token rows exist before relying on cost panels in the + demo. +- `boundary_request` line fields: confirm `owner` and `workspace_name` are on + the live line (section 2b). If absent, recent-request tables show blank + owner/workspace columns; allow/deny counts are unaffected. +- Log retention: Loki retention may drop older `boundary_request` lines. + Use a recent time range (last 15m to 1h) for the demo. +- Prometheus scrape of the aggregated agent metric: section 3 assumes coderd + exposes `agent_boundary_*` on its `/metrics`. If the live label name differs + from source, fix the dashboard to match (not expected based on source). +- The datasource doc references Coder v2.34.0 while the live deployment is + v2.34.1. Cosmetic only; no action required. +- Access: if the executing agent lacks working Grafana/Prometheus/Loki or + kubectl access, treat sections 2 to 5 as steps to run once access is granted + rather than completed checks. + +Generated by Coder Agents. diff --git a/aoi/brief-template-golden-path-e2e.md b/aoi/brief-template-golden-path-e2e.md new file mode 100644 index 0000000..7a6fdd3 --- /dev/null +++ b/aoi/brief-template-golden-path-e2e.md @@ -0,0 +1,215 @@ +# WS-25 Brief: Template Golden-Path End-to-End Verification + +Execution-ready checklist. A parent agent runs this later. Read it in order. +All commands target the live GovCloud demo deployment. + +- Deployment: `https://dev.usgov.coderdemo.io` +- Coder version: v2.34.1 +- Primary org: `coder` (id `5de29a6d-8836-4643-a42b-2cb807c8e3e2`). Other orgs: `alpha`, `bravo`. +- Templates in repo: `/home/coder/demoenv-workspace/usgov-phase2/coder-templates/` + (`ai-agent-generic`, `claude-code`, `cpp-engineer`, `data-scientist`, + `java-engineer`, `platform-engineer`, `firewalled`). `claude-code-ci` is also + registered in Coder. + +Set these shell variables before running steps: + +```bash +CODER_URL="https://dev.usgov.coderdemo.io" +ADMIN_TOKEN="" +ORG_ID="5de29a6d-8836-4643-a42b-2cb807c8e3e2" +``` + +## 1. Objective + +Prove that each demo template builds to a healthy, connected workspace and +passes a basic connectivity check. The goal is to de-risk the live demo's +template flow so that, on demo day, every template starts cleanly and the +agent reports ready. Success per template means: build job completes, +`latest_build.status` is `running`, the agent is `lifecycle_state=ready` and +`status=connected`, and the connectivity smoke test returns HTTP `200`. + +## 2. The GitLab external-auth gate (read before building anything) + +Every `claude-code`-derived template, and `platform-engineer`, declares: + +```hcl +data "coder_external_auth" "gitlab" { + id = "gitlab" +} +``` + +Declaring this data source without `optional = true` makes the workspace +REQUIRE that the workspace OWNER has completed the in-cluster GitLab OAuth +login before the build will proceed. There is NO device flow: `GET +/api/v2/external-auth/gitlab` returns `"device":false`. The login must happen +once, in a browser, at `https://dev.usgov.coderdemo.io/external-auth/gitlab`. + +Current state observed this session: + +- `admin` is NOT GitLab-authenticated. `GET /api/v2/external-auth/gitlab` + returns `authenticated:false`. An admin-initiated `coder create` against a + gitlab-gated template hangs on "Waiting for Git authentication". +- `austenplatform` IS authenticated (has running claude-code workspaces). + +The provisioner uses the OWNER's GitLab token at build time, not the +requester's token. That fact drives both remediation options below. + +### Remediation A (preferred for templates a human will demo) + +Have the demoing user complete the one-time browser OAuth login at +`https://dev.usgov.coderdemo.io/external-auth/gitlab` while logged in as that +user. After this, that user can `coder create` gitlab-gated templates +normally. Confirm with `GET /api/v2/external-auth/gitlab` returning +`authenticated:true` for that user's token. + +### Remediation B (workaround for automated verification) + +Create the workspace via REST for an owner who is ALREADY authenticated (for +example `austenplatform`). The admin token authorizes the request, but the +build uses the owner's GitLab token, so the gate is satisfied. + +```bash +# Resolve the authenticated owner's user id. +curl -sS -H "Coder-Session-Token: $ADMIN_TOKEN" \ + "$CODER_URL/api/v2/users?q=austenplatform" + +OWNER_ID="" + +curl -sS -X POST \ + -H "Coder-Session-Token: $ADMIN_TOKEN" \ + -H "Content-Type: application/json" \ + "$CODER_URL/api/v2/users/$OWNER_ID/workspaces" \ + -d '{ + "template_id": "