diff --git a/aoi/brief-github-auth-mcp.md b/aoi/brief-github-auth-mcp.md new file mode 100644 index 0000000..5a2d268 --- /dev/null +++ b/aoi/brief-github-auth-mcp.md @@ -0,0 +1,303 @@ +# Brief: Authenticated MCP Server in Coder Agents (GitHub hosted MCP) + +## 1. Objective and demo narrative + +Stand up an authenticated MCP server in Coder Agents on +`https://dev.usgov.coderdemo.io` that demonstrates real authentication plus +need-to-know. The approved backend is GitHub's hosted MCP +(`https://api.githubcopilot.com/mcp/`), accessed read-only with a fine-scoped +GitHub token. Narrative: "Coder Agents reaching an authenticated internal +service. The agent can only call tools the credential is allowed to call, and +each user sees only what their identity can access." Attribution (WS-23) is out +of scope. The single highest risk is a client/server protocol mismatch on +`notifications/initialized` (the 204 gate, see section 3), so verify the gate +before committing the demo to GitHub. + +## 2. Prerequisites + +- Admin Coder session token in `$TOKEN` and `CODER_URL=https://dev.usgov.coderdemo.io`. + Environment and admin token setup is documented elsewhere; assume it is ready. +- A fine-scoped GitHub Personal Access Token (PAT) from the user. Use a throwaway + demo org/repo to keep blast radius small. +- Recommended PAT scopes: + - Fine-grained, read-only: Contents Read, Metadata Read, Issues Read, + Pull Requests Read; optional Actions Read; org Members Read; Email Read. + - Classic alternative: `read:user`, `user:email`, `read:org`, `repo`, paired + with the `X-MCP-Readonly: true` header as defense in depth. +- For Path B only: ability to create a GitHub OAuth App in the chosen org. + +Field reference (verified against `codersdk/mcp.go`, +`CreateMCPServerConfigRequest`): `display_name` (required), `slug` (required), +`description`, `icon_url`, `transport` (required, oneof `streamable_http` `sse`), +`url` (required, url), `auth_type` (required, oneof `none` `oauth2` `api_key` +`custom_headers` `user_oidc`), `oauth2_client_id`, `oauth2_client_secret`, +`oauth2_auth_url`, `oauth2_token_url`, `oauth2_scopes`, `api_key_header`, +`api_key_value`, `custom_headers` (map of string to string), `tool_allow_list`, +`tool_deny_list`, `availability` (required, oneof `force_on` `default_on` +`default_off`), `enabled`, `model_intent`, `allow_in_plan_mode`, +`forward_coder_headers`. The POST returns HTTP 201 with the created object +including `id`. + +## 3. THE GATE: 204 vs 202 (verify FIRST) + +Coder's MCP client is `mark3labs/mcp-go` v0.38.0, which accepts only HTTP 200 or +202 on the `notifications/initialized` POST. GitLab's MCP returned 204 and was +dropped (CODAGT-570). GitHub's status on that notification is unverified, so this +gate decides whether GitHub MCP is usable as-is. + +Most authoritative procedure (register, then read coderd logs): + +1. Mint the PAT (section 2). +2. Register the GitHub MCP in Coder with `api_key` + the PAT (section 4 body). +3. Trigger a connection: open a Coder Agents chat with the server enabled, or + list servers, so coderd attempts to connect. +4. Watch coderd logs for a connection-failure line mentioning status 204: + +```sh +kubectl -n coder logs deploy/coder --tail=400 | \ + grep -iE "skipping MCP server.*connection failure|status 204|notifications/initialized" +``` + +Optional direct probe (confirms GitHub's behavior independent of Coder). Read +the status line on the `notifications/initialized` POST: + +```sh +# 1) initialize (capture the Mcp-Session-Id response header if present) +curl -sS -D - -o /dev/null -X POST https://api.githubcopilot.com/mcp/ \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -H "Accept: application/json, text/event-stream" \ + -H "X-MCP-Readonly: true" \ + --data '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"gate-check","version":"0.0.1"}}}' + +# 2) notifications/initialized (echo back Mcp-Session-Id from step 1 if returned) +curl -sS -D - -o /dev/null -X POST https://api.githubcopilot.com/mcp/ \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -H "Accept: application/json, text/event-stream" \ + -H "X-MCP-Readonly: true" \ + -H "Mcp-Session-Id: " \ + --data '{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}' +``` + +Pass/fail decision: + +- Status 200 or 202, and no "skipping MCP server" line: PASS. Proceed with Path A + (or Path B for the per-user headline). +- Status 204, or coderd logs the connection-failure/204 line: FAIL. GitHub MCP is + unusable as-is. Switch to Fallback C (in-boundary datastore MCP), which we + control and can make return 202. + +## 4. Path A (recommended, fastest): api_key + PAT + +Simplest and genuinely authenticated; it is also the same registration that +clears the gate. Caveat: one PAT is one shared identity, so per-user need-to-know +requires either one server per demoed user (per-user PATs) or Path B. + +Exact JSON body. `api_key_value` is set verbatim, so it MUST include the +`Bearer ` prefix: + +```json +{ + "display_name": "GitHub (Internal Service)", + "slug": "github", + "description": "Read-only GitHub access via GitHub hosted MCP.", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "api_key", + "api_key_header": "Authorization", + "api_key_value": "Bearer ", + "tool_allow_list": [ + "get_me", + "search_repositories", + "get_repository", + "search_code", + "list_issues", + "get_issue", + "list_pull_requests", + "get_pull_request" + ], + "availability": "default_off", + "enabled": true +} +``` + +Register: + +```sh +curl -sS -X POST "$CODER_URL/api/experimental/mcp/servers" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data @path/to/body.json +``` + +X-MCP-Readonly header approach (important). The `api_key` auth type sends exactly +ONE header (`api_key_header`/`api_key_value`). It cannot also send a second static +header such as `X-MCP-Readonly: true`. Per `codersdk/mcp.go`, sending multiple +static headers requires `auth_type: custom_headers` with a `custom_headers` map. +To send both the bearer token and the read-only header, use this body instead: + +```json +{ + "display_name": "GitHub (Internal Service)", + "slug": "github", + "description": "Read-only GitHub access via GitHub hosted MCP.", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "custom_headers", + "custom_headers": { + "Authorization": "Bearer ", + "X-MCP-Readonly": "true" + }, + "tool_allow_list": [ + "get_me", + "search_repositories", + "get_repository", + "search_code", + "list_issues", + "get_issue", + "list_pull_requests", + "get_pull_request" + ], + "availability": "default_off", + "enabled": true +} +``` + +Recommendation: use the `custom_headers` body if you want `X-MCP-Readonly: true` +as defense in depth (preferred). Use the `api_key` body only if a single header is +acceptable and the PAT scopes alone enforce read-only. Keep `availability` +`default_off` and `enabled` true so the server exists but users opt in per chat. + +## 5. Path B (per-user RBAC headline): manual oauth2 + GitHub OAuth App + +Best per-user need-to-know story: each user clicks Connect once, Coder stores a +per-user GitHub token, and each user sees only what their GitHub identity allows. +GitHub advertises no DCR `registration_endpoint`, so oauth2 MUST be manual +(pre-registered GitHub OAuth App). For manual oauth2, supply ALL of +`oauth2_client_id`, `oauth2_auth_url`, and `oauth2_token_url`, otherwise Coder +attempts auto-DCR (which fails for GitHub). + +Callback sequencing problem: the OAuth App callback must be +`https://dev.usgov.coderdemo.io/api/experimental/mcp/servers/{id}/oauth2/callback`, +but `{id}` does not exist until the Coder MCP row is created. Resolve in this +order: + +1. Create the Coder MCP row first with placeholder oauth2 values so Coder mints + the `{id}` (returned in the 201 response): + +```json +{ + "display_name": "GitHub (Per-User)", + "slug": "github-oauth", + "transport": "streamable_http", + "url": "https://api.githubcopilot.com/mcp/", + "auth_type": "oauth2", + "oauth2_client_id": "placeholder", + "oauth2_client_secret": "placeholder", + "oauth2_auth_url": "https://github.com/login/oauth/authorize", + "oauth2_token_url": "https://github.com/login/oauth/access_token", + "oauth2_scopes": "read:user user:email read:org repo", + "tool_allow_list": ["get_me", "search_repositories", "get_repository", "list_issues", "get_issue"], + "availability": "default_off", + "enabled": false +} +``` + +2. Create (or edit) the GitHub OAuth App and set its Authorization callback URL to + `https://dev.usgov.coderdemo.io/api/experimental/mcp/servers/{id}/oauth2/callback` + using the `{id}` from step 1. +3. Patch the Coder row with the real client id/secret and enable it: + +```sh +curl -sS -X PATCH "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data '{"oauth2_client_id":"","oauth2_client_secret":"","enabled":true}' +``` + +4. Each user opens the connect URL + (`$CODER_URL/api/experimental/mcp/servers/{id}/oauth2/connect`) from the chat UI, + authorizes once, and Coder stores their per-user token. Note: oauth2 does not + carry the `X-MCP-Readonly` header; enforce read-only via scopes and + `tool_allow_list`. + +## 6. Fallback C (in-boundary, clean optics): authenticated datastore MCP + +If the gate fails or egress optics must stay inside the GovCloud boundary, add +auth to the existing datastore MCP (`deploy/datastore-mcp`). It currently runs as +`auth_type: none` at +`http://datastore-mcp.coder-demo-mcp.svc.cluster.local:8000/mcp` and is reached +in-cluster. Because we own the code, we control the `notifications/initialized` +response and can guarantee the 202 gate passes. Ranked options: + +1. Manual `oauth2` via Keycloak: real per-user auth, in-boundary, best optics. The + MCP server must validate the access token (issuer, audience, expiry) and map + the subject to authorized rows. Supply Keycloak `oauth2_auth_url`, + `oauth2_token_url`, `oauth2_client_id`, `oauth2_client_secret`, `oauth2_scopes`, + and set the Keycloak client callback to the Coder + `/oauth2/callback` URL for that server `{id}` (same sequencing as Path B). +2. `user_oidc`: Coder forwards the user's OIDC token to the MCP server, which must + verify the audience and enforce per-user access. Less setup than full oauth2, + still per-user. +3. `api_key`: shared static credential, simplest, but a single shared identity (no + per-user need-to-know). + +Implementation note: the current datastore server does not validate the inbound +Authorization header (see `server/main.go`), so options 1 and 2 require adding +token verification before they are a true auth demo. Option 3 only requires Coder +to send the header and the server to check it. + +## 7. Verification + +- Connected: re-run the section 3 log grep and confirm NO "skipping MCP server" + line for the slug. Optionally `GET $CODER_URL/api/experimental/mcp/servers` and + confirm the row is present with `enabled: true`. +- Visible to the model: open a Coder Agents chat, enable the server (it is + `default_off`), and confirm the tools appear in the chat tools listing / + model picker as `github__` (datastore tools appear as `datastore__`, + same `slug__tool` convention). +- Smoke test (read-only): ask the agent to call a read-only tool, for example + `github__get_me` ("who am I authenticated as?") or + `github__search_repositories` against the throwaway demo org. Confirm it returns + real data and that a write-style tool is absent because it is not in + `tool_allow_list`. + +## 8. Rollback + +- Disable (keep the row): PATCH `enabled:false`. + +```sh +curl -sS -X PATCH "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" -H "Content-Type: application/json" \ + --data '{"enabled":false}' +``` + +- Delete (remove the row): DELETE returns HTTP 204. + +```sh +curl -sS -X DELETE "$CODER_URL/api/experimental/mcp/servers/{id}" \ + -H "Coder-Session-Token: $TOKEN" +``` + +- Revoke the PAT or the GitHub OAuth App in GitHub after the demo. For Path B, + users can also disconnect their token via + `DELETE $CODER_URL/api/experimental/mcp/servers/{id}/oauth2/disconnect`. + +## 9. Risks and open questions + +- 204 gate (highest risk): if GitHub returns 204 on `notifications/initialized`, + GitHub MCP is unusable as-is and the demo must use Fallback C. Verify before + committing. +- Egress / optics: GitHub MCP egresses to public GitHub, so packets and tokens + leave the GovCloud boundary even though the narrative says "internal service." + Mitigate with read-only tools, `X-MCP-Readonly: true`, a scoped PAT, and a + throwaway org/repo. If optics must stay in-boundary, make Fallback C primary. +- Shared vs per-user identity: Path A (api_key) is one shared identity. The + per-user need-to-know headline needs Path B (oauth2) or one server per user. +- The MCP servers config is a live, DB-resident object, not in git, so the row + must be recreated by hand if the database is reset. +- Open: which GitHub org/repos for the PAT or OAuth App? Is calling `github.com` + acceptable for demo optics, or must the authenticated MCP stay in-boundary + (then Fallback C is primary)? Auth headline preference: per-user RBAC (oauth2) + or fastest-authenticated (api_key)? + +Generated by Coder Agents. diff --git a/aoi/brief-observability-audit-readiness.md b/aoi/brief-observability-audit-readiness.md new file mode 100644 index 0000000..e149f68 --- /dev/null +++ b/aoi/brief-observability-audit-readiness.md @@ -0,0 +1,261 @@ +# Brief: Observability and Audit Readiness for the Thursday Demo + +Execution-ready verification brief. Read-only. Another agent will execute it. + +Authoritative context (verified this session): + +- Deployment: https://dev.usgov.coderdemo.io, Coder v2.34.1 enterprise, GovCloud + EKS, namespace `coder`. AI Governance add-on entitled (AI Bridge + Boundary). +- Coder Boundary (Agent Firewall) is enabled on a "firewalled" template. A live + jailed workspace `austenplatform/firewall-test` is running. coderd now emits + structured `boundary_request` audit lines (msg=boundary_request), visible via + `kubectl -n coder logs deploy/coder`. Source: + `/home/coder/coder/coderd/agentapi/boundary_logs.go`. +- Observability assets base path (this is where the files actually live; the + repo-relative form `deploy/observability/*` is used below): + `/home/coder/demoenv-workspace/usgov-phase2/deploy/observability/`. +- Dashboards present: `dashboards-boundary.yaml` (uid `agent-firewall`), + `dashboards-aibridge.yaml` (uid `ai-gateway`), `dashboards-coder.yaml`. + Datasources: `loki` (Loki), `prometheus` (Prometheus), `aibridge-postgres` + (Coder RDS Postgres, read-only role `grafana_ro`). + +## 1. Objective + +Confirm that the audit and observability surfaces show live data for the +Thursday demo flow: + +1. Agent Firewall egress allow/deny (Boundary), via the `agent-firewall` + Grafana dashboard backed by Loki `boundary_request` events. +2. AI Gateway usage (AI Bridge): providers, interceptions, tokens, and cost, + via the `ai-gateway` dashboard backed by the `aibridge-postgres` datasource. +3. Coder audit log: template pushes, workspace builds, and governance changes + (MCP/spend limits), via the Coder UI `/audit` and API `/api/v2/audit`. + +The deliverable for the executing agent is a pass/fail check against each +surface, plus the one concrete fix in section 7. + +## 2. Boundary (Agent Firewall) dashboard verification + +Dashboard: `dashboards-boundary.yaml`, uid `agent-firewall`, title +"Agent Firewall". Row "Coder Agent Firewall" holds the audit panels; row +"Agent Firewall Operations" holds Prometheus and proxy-log panels. + +### 2a. Confirm Loki ingests coderd logs + +Promtail scrapes all namespaces with no namespace filter (see +`promtail.yaml`, it maps `__meta_kubernetes_namespace` to label `namespace`), +so coderd logs in namespace `coder` are ingested. The audit panels select +`{namespace=~`(coder|coder-workspaces)`}`, which covers coderd. + +Verify ingestion (Grafana Explore, datasource Loki, or LogCLI): + +``` +{namespace=~"(coder|coder-workspaces)"} |= "boundary_request" | logfmt | decision=~"deny|allow" +``` + +Expect non-empty results. Boundary is jailing Claude Code in +`firewall-test`, which produces continuous deny events (for example +`api.anthropic.com` and `raw.githubusercontent.com`), and allowed events for +gateway traffic to `dev.usgov.coderdemo.io`. + +### 2b. Panels to check (exact panel titles and queries) + +- "Egress Audit (allow / deny)" (Loki, uid `loki`): + +``` +sum by (decision) (count_over_time({namespace=~`(coder|coder-workspaces)`} |= `boundary_request` | logfmt | decision=~`deny|allow` | owner=~`$owner` | domain=~`$domain` | template_id=~`$template_id` | template_version_id=~`$template_version_id` [$__range])) +``` + +- "Top Allowed Domains" and "Top Denied Domains" (Loki) parse the domain from + `http_url` with `regexp` and `topk(20, sum by (domain) (...))`. +- "Most recent allowed requests" and "Most recent denied requests" (Loki) use + `decision=`allow`` / `decision=`deny`` and `line_format` over fields + `event_time`, `http_method`, `domain`, `path`, `owner`, `workspace_name`, + `template_id`, `template_version_id`. + +Dashboard variables (`domain`, `owner`, `template_id`, `template_version_id`) +are textbox type, default empty. Empty regex matches all, so the allow/deny +panels populate with no variables set. Leave them blank for the demo unless +filtering to `austenplatform`. + +Field dependency to confirm on a real line: the `line_format` and the +domain `topk` panels assume the live `boundary_request` line contains +`owner`, `workspace_name`, and a parseable `http_url`. The emitter in +`boundary_logs.go` writes `decision`, `workspace_id`, `template_id`, +`template_version_id`, `http_method`, `http_url`, `event_time`, and +`matched_rule` (allow only); `owner`/`workspace_name`/`agent_name` are added by +the parent logger. Inspect one real line and confirm those fields are present: + +``` +kubectl -n coder logs deploy/coder --since=15m | grep boundary_request | head -3 +``` + +If `owner` or `workspace_name` are absent, the allow/deny counts still work +(missing label matches the empty regex), but the recent-request tables show +blank owner/workspace columns. Record this as an observation, not a blocker. + +### 2c. Generate fresh allow/deny events on demand + +From a workspace terminal on the firewalled template: + +- Deny: `boundary --proxy-port 8091 -- curl https://example.com` +- Allow: `curl https://dev.usgov.coderdemo.io` + +The firewalled template's Claude Code already emits continuous deny events, so +fresh generation is optional for the demo. + +## 3. Prometheus metric-name reconciliation + +Dashboard `dashboards-boundary.yaml` uses +`agent_boundary_log_proxy_batches_forwarded_total` in panels "Total Batches +Forwarded", "Active Firewall Agents", and "Forwarded Batches by Workspace". + +Source of truth (`/home/coder/coder/agent/boundarylogproxy/metrics.go`): + +``` +Namespace: "agent" +Subsystem: "boundary_log_proxy" +Name: "batches_forwarded_total" +``` + +Prometheus joins these as `agent_boundary_log_proxy_batches_forwarded_total`. +Therefore the dashboard name is correct, and the prefix-less spelling +`boundary_log_proxy_batches_forwarded_total` cited in two phase2 docs is wrong. + +Confirm the exported name against the live stack (any one): + +``` +# Prometheus label values +curl -s http:///api/v1/label/__name__/values | jq -r '.data[]' | grep -i boundary + +# coderd aggregated agent metrics (this metric is an agent metric aggregated by coderd) +kubectl -n coder exec deploy/coder -- wget -qO- http://localhost:2112/metrics | grep -i boundary +``` + +Expect `agent_boundary_log_proxy_batches_forwarded_total` (plus +`agent_boundary_log_proxy_batches_dropped_total` and +`agent_boundary_log_proxy_logs_dropped_total`). The metric carries labels +`username`, `workspace_name`, `agent_name` from the coderd aggregator, which +the "Forwarded Batches by Workspace" panel groups by (`workspace_name`, +`username`). + +If the live label name turns out to differ from the source, prefer fixing the +dashboard to match the live name. Based on source, no dashboard change is +expected; the fix belongs in the docs (section 7). + +## 4. AI Bridge (AI Gateway) dashboard verification + +Dashboard: `dashboards-aibridge.yaml`, uid `ai-gateway`, title "AI Gateway". + +### 4a. Confirm the Postgres datasource is connected + +Datasource `aibridge-postgres` (`datasource-aibridge-postgres.yaml`) points to +`usgov-coderdemo-pg...rds.amazonaws.com:5432`, database `coder`, user +`grafana_ro`, password from env `${AIGOV_DB_PASSWORD}` (Secret +`aigov-grafana-db` in namespace `monitoring`). Verify in Grafana: +Connections, Data sources, "AI Gateway DB", Save & test, expect success. + +### 4b. Panels showing live data (Postgres) + +- "Total Interceptions": `SELECT count(*) AS value FROM aibridge_interceptions WHERE $__timeFilter(started_at)` +- "Active Sessions": `count(DISTINCT session_id)` over `aibridge_interceptions` +- "Unique Users": `count(DISTINCT initiator_id)` over `aibridge_interceptions` +- "Interceptions by Provider/Model/User", "Recent Interceptions", "Sessions". + +Usage and cost panels ("Input/Output/Cache/Total Tokens", "Estimated Cost", +"Tokens Over Time", "Estimated Cost Over Time", "Top Users by Usage & Cost", +"Token Usage Detail") read from `aibridge_token_usages` joined to +`ai_model_prices` (71 rows, includes `claude-sonnet-4-5`). Confirm whether +token rows exist; if the Anthropic key in use is a placeholder, these can be +zero by design. Because the gateway has been used this session, confirm live +token/cost data is present and call it out if still zero. + +Provider-health stats ("Configured Providers", "Provider Reload Status", +"Last Successful Reload", "Provider Inventory") come from Prometheus +`coder_aibridged_*`; the "AI Gateway Log Stream" and event-rate panels come +from Loki (namespace `coder`). Confirm each row renders without datasource +errors. + +## 5. Coder audit log verification + +UI: open `https://dev.usgov.coderdemo.io/audit` as an admin. API: + +``` +curl -sS -H "Coder-Session-Token: $CODER_SESSION_TOKEN" \ + "https://dev.usgov.coderdemo.io/api/v2/audit?limit=50" | jq '.audit_logs[] | {action, resource_type, time}' +``` + +Confirm the log records the demo-relevant actions: + +- Template pushes / new template versions (resource_type `template` or + `template_version`, action `create`/`write`), including the firewalled + template. +- Workspace builds (resource_type `workspace_build` / `workspace`). +- Governance changes for the demo: MCP server config and spend-limit changes + (filter the UI by the relevant resource type, or grep the API response for + the changed fields). Confirm at least one such entry exists; if none, perform + one change before the demo so it appears. + +Note the audit log (Postgres `audit_logs`) is distinct from the +`boundary_request` application logs in Loki. Both must be checked. + +## 6. Demo-day checklist (5 minutes) + +1. Grafana, dashboard "Agent Firewall": "Egress Audit (allow / deny)" shows + both allow and deny in the last 15m. If flat, run the deny/allow curls in + section 2c. +2. Same dashboard: "Top Denied Domains" lists `api.anthropic.com` / + `raw.githubusercontent.com`; "Most recent denied requests" table populated. +3. Same dashboard: "Total Batches Forwarded" stat is non-zero (Prometheus). +4. Grafana, dashboard "AI Gateway": "Total Interceptions", "Active Sessions", + "Unique Users" non-zero; "Interceptions by Provider" populated. If tokens + were generated, confirm "Estimated Cost" non-zero. +5. Coder UI `/audit`: a recent template push and a workspace build are visible. +6. Confirm no panel shows a red datasource error (loki, prometheus, + aibridge-postgres all healthy under Grafana, Connections, Data sources). + +## 7. Concrete fixes found (described only, do not edit) + +One fix, in docs (the dashboard is already correct): + +- File: `deploy/observability/../docs/architecture/agent-firewall-feasibility.md` + (absolute: `/home/coder/demoenv-workspace/usgov-phase2/docs/architecture/agent-firewall-feasibility.md`), + line 101. Replace `boundary_log_proxy_batches_forwarded_total` with + `agent_boundary_log_proxy_batches_forwarded_total`. +- File: + `/home/coder/demoenv-workspace/usgov-phase2/aoi/plan-firewall-and-auth-mcp.md`, + line 131. Same replacement: add the `agent_` prefix so the cited metric + matches the exported name and the dashboard. + +Stale-doc note (optional, lower priority): both +`deploy/observability/AI_GOVERNANCE_DASHBOARD.md` (around lines 138 to 144) and +the header comment of `deploy/observability/dashboards-boundary.yaml` (around +lines 25 to 27) state that `boundary_request` allow/deny events "are not +emitted in this stack yet". That is now false on Coder v2.34.1; coderd emits +them and the `agent-firewall` dashboard's allow/deny panels populate. If time +allows, update that prose to reflect that allow/deny audit is now live. Do not +change any panel JSON; the queries are correct. + +No dashboard JSON edits are required. + +## 8. Risks and open questions + +- Token/cost panels depend on real metered AI traffic. If the Anthropic key is + a placeholder, `aibridge_token_usages` may be empty and cost reads zero by + design. Confirm live token rows exist before relying on cost panels in the + demo. +- `boundary_request` line fields: confirm `owner` and `workspace_name` are on + the live line (section 2b). If absent, recent-request tables show blank + owner/workspace columns; allow/deny counts are unaffected. +- Log retention: Loki retention may drop older `boundary_request` lines. + Use a recent time range (last 15m to 1h) for the demo. +- Prometheus scrape of the aggregated agent metric: section 3 assumes coderd + exposes `agent_boundary_*` on its `/metrics`. If the live label name differs + from source, fix the dashboard to match (not expected based on source). +- The datasource doc references Coder v2.34.0 while the live deployment is + v2.34.1. Cosmetic only; no action required. +- Access: if the executing agent lacks working Grafana/Prometheus/Loki or + kubectl access, treat sections 2 to 5 as steps to run once access is granted + rather than completed checks. + +Generated by Coder Agents. diff --git a/aoi/brief-template-golden-path-e2e.md b/aoi/brief-template-golden-path-e2e.md new file mode 100644 index 0000000..7a6fdd3 --- /dev/null +++ b/aoi/brief-template-golden-path-e2e.md @@ -0,0 +1,215 @@ +# WS-25 Brief: Template Golden-Path End-to-End Verification + +Execution-ready checklist. A parent agent runs this later. Read it in order. +All commands target the live GovCloud demo deployment. + +- Deployment: `https://dev.usgov.coderdemo.io` +- Coder version: v2.34.1 +- Primary org: `coder` (id `5de29a6d-8836-4643-a42b-2cb807c8e3e2`). Other orgs: `alpha`, `bravo`. +- Templates in repo: `/home/coder/demoenv-workspace/usgov-phase2/coder-templates/` + (`ai-agent-generic`, `claude-code`, `cpp-engineer`, `data-scientist`, + `java-engineer`, `platform-engineer`, `firewalled`). `claude-code-ci` is also + registered in Coder. + +Set these shell variables before running steps: + +```bash +CODER_URL="https://dev.usgov.coderdemo.io" +ADMIN_TOKEN="" +ORG_ID="5de29a6d-8836-4643-a42b-2cb807c8e3e2" +``` + +## 1. Objective + +Prove that each demo template builds to a healthy, connected workspace and +passes a basic connectivity check. The goal is to de-risk the live demo's +template flow so that, on demo day, every template starts cleanly and the +agent reports ready. Success per template means: build job completes, +`latest_build.status` is `running`, the agent is `lifecycle_state=ready` and +`status=connected`, and the connectivity smoke test returns HTTP `200`. + +## 2. The GitLab external-auth gate (read before building anything) + +Every `claude-code`-derived template, and `platform-engineer`, declares: + +```hcl +data "coder_external_auth" "gitlab" { + id = "gitlab" +} +``` + +Declaring this data source without `optional = true` makes the workspace +REQUIRE that the workspace OWNER has completed the in-cluster GitLab OAuth +login before the build will proceed. There is NO device flow: `GET +/api/v2/external-auth/gitlab` returns `"device":false`. The login must happen +once, in a browser, at `https://dev.usgov.coderdemo.io/external-auth/gitlab`. + +Current state observed this session: + +- `admin` is NOT GitLab-authenticated. `GET /api/v2/external-auth/gitlab` + returns `authenticated:false`. An admin-initiated `coder create` against a + gitlab-gated template hangs on "Waiting for Git authentication". +- `austenplatform` IS authenticated (has running claude-code workspaces). + +The provisioner uses the OWNER's GitLab token at build time, not the +requester's token. That fact drives both remediation options below. + +### Remediation A (preferred for templates a human will demo) + +Have the demoing user complete the one-time browser OAuth login at +`https://dev.usgov.coderdemo.io/external-auth/gitlab` while logged in as that +user. After this, that user can `coder create` gitlab-gated templates +normally. Confirm with `GET /api/v2/external-auth/gitlab` returning +`authenticated:true` for that user's token. + +### Remediation B (workaround for automated verification) + +Create the workspace via REST for an owner who is ALREADY authenticated (for +example `austenplatform`). The admin token authorizes the request, but the +build uses the owner's GitLab token, so the gate is satisfied. + +```bash +# Resolve the authenticated owner's user id. +curl -sS -H "Coder-Session-Token: $ADMIN_TOKEN" \ + "$CODER_URL/api/v2/users?q=austenplatform" + +OWNER_ID="" + +curl -sS -X POST \ + -H "Coder-Session-Token: $ADMIN_TOKEN" \ + -H "Content-Type: application/json" \ + "$CODER_URL/api/v2/users/$OWNER_ID/workspaces" \ + -d '{ + "template_id": "