A local playground for practicing AI-augmented observability, incident response, and custom MCP development with OpenCode. The whole stack runs on your laptop via Podman.
Two slots of two hours each. By the end you will have:
- Used AI agents with pre-built MCPs and skills on a real observability stack
- Customized a skill
- Built your own TypeScript MCP
- Posted an AI-generated incident ticket into a real GitHub Project
- macOS (Apple Silicon or Intel)
- Podman 5+ with a running machine (
podman machine listshowsCurrently running) - podman-compose (
brew install podman-compose) - Node 24 via nvm (
nvm install 24) - Git, authenticated to push to
squer-solutions - OpenCode configured for cortecs.ai (set up in the prep session)
- For Slot 2 only: a GitHub PAT with
repo+projectscopes — create at https://github.com/settings/tokens and:export GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxx
git clone git@github.com:squer-solutions/ai-in-devops-workshop.git
cd ai-in-devops-workshop
cp .env.example .env # no edits needed for Slot 1
nvm use # picks up .nvmrc → Node 24
npm install
cd compose
podman-compose pull # ~2-3 GB, do this ahead of time
podman-compose up -d
sleep 20
curl http://localhost:8080/health
# → {"status":"ok",...}
open http://localhost:3000 # Grafana (anonymous editor)Point OpenCode at opencode/opencode.json via your OpenCode settings.
If things are slow, run the lite profile instead:
podman-compose -f docker-compose.yml -f docker-compose.lite.yml up -d- Read
scenarios/scenario-a-checkout-slow.md - Run the
investigate-latencyskill against the injected incident - Customize the skill (your facilitator will outline options)
cd slot2-scaffold && npm install- Read
slot2-scaffold/README.md - Implement
create_incident_issueandadd_issue_to_project - Plug your built MCP into OpenCode
- Use your
incident-commanderskill during the live drill
./scenarios/inject.sh claims-api slow-db # enable a mode
./scenarios/inject.sh claims-api # clear
./scenarios/inject.sh claims-worker queue-backupModes: cpu-hog, slow-db, memory-leak, error-spike, queue-backup, db-conn-leak.
cd compose
podman-compose down -vapp/ # claims-api + claims-worker (Node/TypeScript)
compose/ # podman-compose stack + observability configs
opencode/ # OpenCode MCP wiring
skills/ # pre-built Slot 1 skills
scenarios/ # chaos injector + scenario briefs
slot2-scaffold/ # starter TypeScript MCP for Slot 2
The sample domain is deliberately small so it's easy to hold in your head. Two TypeScript services, one Postgres DB, one Redis queue.
client ──POST /claims──▶ claims-api ──INSERT──▶ postgres (status=pending)
│
└──LPUSH──▶ redis (claims:queue)
│
└──BRPOP──▶ claims-worker ──UPDATE──▶ postgres (status=approved)
- A client posts a claim to
claims-api(POST /claims). claims-apiinserts a row into Postgres withstatus='pending'and enqueues a job on the Redis listclaims:queue.claims-workerblocks onBRPOP claims:queue, dequeues a job, and updates the DB row tostatus='approved'.- A follow-up
GET /claims/:idreturns the final row.
All three of: claims-api, claims-worker, Postgres query spans — are instrumented via the OpenTelemetry Node SDK, which auto-patches http, pg, and ioredis. Metrics and traces flow through the OTel collector to Prometheus/Tempo. Logs are JSON lines written to a shared volume; the collector's filelog receiver tails them and forwards to Loki.
| Endpoint | Purpose |
|---|---|
GET /health |
Liveness — returns {status: "ok"}. Used by dashboards and readiness checks. |
POST /claims |
Body {customerId, amountCents, description}. Writes to Postgres, enqueues a Redis job, returns {id, status: "pending"}. |
GET /claims/:id |
Fetches a claim row. |
POST /chaos |
Facilitator-only. Body {mode: "<chaos-mode>"} turns a failure mode on; empty body clears it. The current mode lives in process memory. |
Every request emits a pino log line with reqId, URL, status, and responseTime — you'll see these in Loki.
Runs a BRPOP loop on claims:queue. For each job, it simulates a "fraud check" (a short computation) and marks the claim approved. Emits "processed" log lines with the claim ID.
Port 8081 exposes the same POST /chaos endpoint so the worker can be targeted independently (e.g. queue-backup mode only makes sense on the worker).
| Mode | What it does | What you see |
|---|---|---|
slow-db |
Adds 800 ms sleep to every request handler | p95 latency doubles; request logs show responseTime ≈ 800 |
error-spike |
~30% of requests throw mid-handler | 5xx rate climbs; chaos: random error-spike lines appear in Loki |
cpu-hog |
Burns CPU on every event-loop tick | Container CPU rises; p95 drifts up under load |
memory-leak |
Allocates 10 MB/tick, never frees | Container memory climbs linearly; eventually OOMs |
queue-backup |
Worker skips jobs on the floor (target claims-worker) |
claims:queue depth grows; claims stay pending in the DB |
db-conn-leak |
Leaks one pooled connection per request | Slow degradation → too many connections errors |
Inject with:
./scenarios/inject.sh <claims-api|claims-worker> <mode>
./scenarios/inject.sh claims-api # empty mode = clear| Service | Role | URL |
|---|---|---|
otel-collector |
Single pipeline for metrics, traces, logs | :4318 (OTLP/HTTP) |
prometheus |
Scrapes collector's :8889/metrics endpoint |
http://localhost:9090 |
loki |
Log storage, queried via LogQL | http://localhost:3100 |
tempo |
Trace storage | http://localhost:3200 |
grafana |
Dashboards + unified query UI | http://localhost:3000 (admin / workshop-grafana-admin) |
Dashboards are auto-provisioned under the "Workshop" folder: App Overview (HTTP signals + recent logs) and Services Health (CPU/memory + worker rate).
Dashboards are empty when no traffic is flowing. To see live numbers:
for i in $(seq 1 60); do
curl -s -X POST http://localhost:8080/claims \
-H "content-type: application/json" \
-d '{"customerId":"c1","amountCents":100,"description":"warmup"}' > /dev/null
sleep 0.5
doneThe grafana MCP in opencode/opencode.json runs the official docker.io/mcp/grafana image. It needs to call Grafana's HTTP API to execute PromQL/LogQL/TraceQL. For the workshop we use HTTP basic auth with the admin account configured in compose:
GRAFANA_USERNAME=admin
GRAFANA_PASSWORD=workshop-grafana-admin
Both values are baked into opencode/opencode.json — there's nothing to export and no service-account token to generate. This is fine for a local workshop but don't reuse these creds anywhere real.
In production you would instead provision a Grafana service account and pass its token via
GRAFANA_SERVICE_ACCOUNT_TOKEN. The MCP supports that out of the box; we just don't need it here.
Ask your facilitator first. For common setup pain points, they have a troubleshooting cheat sheet.