A field lab that teaches the actual skill of going from a vague founder DM ("approvals are killing us") to a buildable tech spec — with a real tools catalog, real tradeoffs, and an ADR for the biggest decision.
The lab focuses on pattern recognition: given a problem, which architectural shape, which patterns, which tools, and what skills do you need? It's the senior-engineer judgment call, not "fill out this template."
source/
guide.md Source guide (markdown)
Problem_to_Tech_Spec_Teaching_Guide.docx Generated from guide.md
app/
index.html Single-page lab (no backend, no build step)
styles.css All styling
content.js 8 stages, tech catalog, Stride case, per-stage evaluators
app.js Renderer, evaluator-driven feedback loop, modes, persistence
cd app
python3 -m http.server 5187
open http://127.0.0.1:5187No npm, no build, no dependencies. Just a static page.
Lab mode (coached). After each submission, an evaluator runs and produces hints if your answer isn't yet at the maturity bar — you retry until it is. Then the model answer + sensitivity scenarios reveal.
Assessment mode (graded). No hints. Submit each stage once and move on. At the end you get a graded report against the rubric.
Both modes share the same 8 stages.
| # | Stage | What you produce | New skill taught |
|---|---|---|---|
| 0 | Receive the brief | 5 clarifying questions | Resist designing — ask outcome/users/non-goals first |
| 1 | Problem brief | Measurable problem statement | Translate "feels slow" into numbers |
| 2 | Recognize the shape | Workflow / RAG / CRUD / search / realtime / batch / analytics / agent | Pattern recognition (senior-engineer skill) |
| 3 | Constraints → architectural responses | Map 8 constraints to 8 patterns | Constraint-driven design |
| 4 | Pick the stack, layer by layer | 4 layers × real tools (Trigger.dev / Postgres / Slack+email / Sentry) with rationale citing requirements | Real-world tool selection |
| 5 | Skills gap + spikes | Team skills inventory; 2 time-boxed spikes for highest-risk gaps | Honest self-assessment, de-risking |
| 6 | NFRs as measurable scenarios | Stimulus / environment / response / measure | Load-testable contracts |
| 7 | Tech spec + ADR | TL;DR + 5-field ADR for biggest decision | Capturing decisions for future-you |
Series-A B2B expense-management startup. The CEO DMs: "Approvals are killing us..."
You walk it up to: shape (workflow), constraints (audit / scale / cost / HITL), stack (Trigger.dev + Postgres events + Slack+email + Sentry), spikes (2 days for HITL signal, 1 day for delivery SLO), NFR (30s p95 delivery at 100/10min), TL;DR + ADR for the workflow-engine choice.
Series-Seed AI customer-support startup. The CEO DMs: "Customers complain that the bot gives wrong pricing answers..."
You walk the same ladder, but the answers are entirely different: shape (RAG), constraints (hallucination / freshness / latency / cost / citations / PII), stack (pgvector + OpenAI 3-large + Claude Sonnet + structural+hybrid+rerank chunking + LiteLLM + Langfuse), spikes (3 days for chunking lift, 2 days for nightly eval pipeline), NFR (≥98% on golden set + <3s p95 + $0.05/answer), TL;DR + ADR for the chunking/retrieval choice.
The AI-specific coaching catches anti-patterns that bite real teams: jumping to "let's swap LLMs" before diagnosing the failure mode (80% of RAG failures are retrieval, not generation), shipping naive 500-token chunks, skipping evals, picking Voyage embeddings for general-domain content where OpenAI 3-large is cheaper at parity.
Workflow shape (Stride): Trigger.dev · Inngest · Temporal · DIY Postgres+cron; Postgres-events · DynamoDB+Streams · Kafka; email-only · Slack+email pluggable · all-channels; logs-only · Sentry+Grafana/Datadog · Honeycomb/OTel.
RAG shape (Ember):
- Retrieval: pgvector · Pinecone · Qdrant · Weaviate
- Embedding: OpenAI text-embedding-3-large · Voyage voyage-3-large · Google text-embedding-005 · BGE-M3 self-hosted
- LLM: Claude Sonnet 4.6 · GPT-5.x · tiered routing · open-weight via vLLM
- Chunking: naive 500-token · structural · structural+hybrid+rerank
- Gateway: direct SDKs · LiteLLM · Portkey · OpenRouter
- Evals: eyeball · Langfuse · Braintrust · Arize Phoenix
Each option carries "fits when / fails when / cost / risk" — so picking is a thinking exercise, not a memory test.
The lab puts the onus on the learner. It scaffolds the structure of an answer (fields, dropdowns, drag-and-drop) but you bring the judgment. Empty text boxes are drop-off traps; per-layer pickers and bucketed prompts keep you moving while making you think.
On submit, the evaluator returns hints at two severity levels:
- MUST — blocks pass. You retry.
- TIP — soft suggestion. You can still pass but you'll know what to improve.
The reveal panels (Lab mode only) show three things:
- Model answer vs. your final attempt
- Sensitivity — what would change my picks if a constraint changed
- Wrong-but-common alternatives with the failure mode named
Persistence is in localStorage; reset wipes the session.