Agentic Engineering Workflow — A Curated Resource Guide

Backend & infrastructure focus. Agent/language-agnostic. Continuously maintained — links verified weekly.

0. If you only read 7 things

Building Effective Agents — Anthropic (Schluntz/Zhang). The "workflows vs agents" mental model that everything else builds on.
Effective context engineering for AI agents — Anthropic. The successor concept to prompt engineering; defines the actual job.
12-Factor Agents — Dex Horthy / HumanLayer. Heroku's 12-factor reframed for LLM systems; the canonical "agents are mostly software" doctrine.
How we built our multi-agent research system — Anthropic. The best single multi-agent case study, with concrete failure modes.
Don't Build Multi-Agents — Cognition. Read alongside #4 — the productive disagreement at the heart of agent architecture in 2025/26.
A practical guide to building agents (PDF) — OpenAI. The complementary canonical from the other lab.
What We Learned From a Year of Building with LLMs — Yan/Bischof/Frye/Husain/Liu/Shankar. Tactical → operational → strategic; the field's distilled playbook.

1. Foundational design & "what is an agent"

Building Effective Agents — Anthropic. Workflows (orchestrator/router/parallel/eval-optimizer) vs true agents.
Effective context engineering for AI agents — Anthropic. Compaction, sub-agents, structured note-taking.
Writing effective tools for AI agents — Anthropic. Tools as contracts between deterministic and non-deterministic systems; eval-driven tool refinement.
Effective harnesses for long-running agents — Anthropic. Initializer + worker pattern for cross-context continuity (claude-progress.txt + git).
How Anthropic teams use Claude Code — Anthropic. Internal-team patterns across infra, security, data science.
Building agents with the Claude Agent SDK — Anthropic.
A practical guide to building agents (PDF) — OpenAI. Use cases, design foundations, multi-agent, guardrails.
12-Factor Agents — Dex Horthy. Coined "context engineering" in April 2025.
12-Factor Agents talk (YouTube) — Dex Horthy. ~30 min version of the README.
Don't Build Multi-Agents — Cognition. Context fragmentation as the core failure mode.
Agentic Engineering Patterns — Simon Willison. Living catalog of patterns for code-generating-and-executing agents.
The lethal trifecta for AI agents — Simon Willison. Private data + untrusted content + exfiltration channel = the agent threat model.
Simon Willison's ai-agents tag — Best running curation in the field.
Patterns for Building LLM-based Systems & Products — Eugene Yan. Evals/RAG/Cache/Guardrails/Defensive-UX.
Building A Generative AI Platform — Chip Huyen. Reference architecture: gateway → routing → cache → guardrails → telemetry.

2. Tool integration & MCP

Model Context Protocol — Specification — Latest spec; JSON-RPC 2.0; tools/resources/prompts primitives.
MCP — Architecture overview — Mental model.
MCP — Authorization — OAuth flow; required reading before exposing MCP servers.
The 2026 MCP Roadmap — Where the protocol is headed.
modelcontextprotocol/servers — Official reference implementations.
wong2/awesome-mcp-servers and appcypher/awesome-mcp-servers — Two best-maintained registries.
Anthropic — Tool use docs — Schema design, parallel calls, structured outputs.
Announcing the Agent2Agent Protocol (A2A) — Google. Agent-to-agent (vs agent-to-tool) protocol.
A2A specification — Now under Linux Foundation; gRPC support since v0.3.

3. Multi-agent orchestration frameworks

LangGraph overview — Stateful graph runtime; 1.0 shipped Oct 2025.
LangGraph — persistence & checkpointing — Threads, checkpointers, cross-thread memory.
LangGraph — human-in-the-loop — interrupt(), time-travel, approval gates.
LangGraph — multi-agent systems — Supervisor, swarm, hierarchical patterns.
OpenAI Agents SDK (Python) — Successor to Swarm.
openai/openai-agents-python — Source.
Microsoft Agent Framework — Successor to AutoGen + Semantic Kernel; .NET/Python.
AutoGen v0.4 — Asynchronous actor-model multi-agent runtime.
Google ADK (Agent Development Kit) docs — Open-source, multi-language (Python/TS/Go/Java).
google/adk-python — Source.
CrewAI docs — Role/task/crew abstraction; lighter than LangGraph.
Pydantic AI — Type-safe agents with DI; the pleasant Python option.
smolagents — Hugging Face. Minimalist code-acting agent library.
Mastra — TypeScript-first.
Inngest AgentKit — TS framework on top of Inngest's durable runtime.

4. Durable execution for agents

Temporal — Build resilient Agentic AI with Temporal — Why agent loops belong in workflow engines.
Temporal — Durable Execution meets AI — Tools as activities, signals for HITL, child workflows for sub-agents.
temporal-community/temporal-ai-agent — Reference implementation.
Inngest — Durable Execution: The Key to Harnessing AI Agents in Production — Step functions wrapping LLM calls.
Restate — AI agents — Lightweight durable execution; agents as virtual objects.
Hatchet — Durable Tasks — Postgres-backed task queue with agent-aware patterns (agentic loops, HITL).
DBOS — Durable Execution for Building Crashproof AI Agents — Postgres-as-runtime; smaller-team alternative to Temporal.

5. Memory systems

Anthropic — Memory tool — First-party file-backed memory.
Letta docs — Production fork of MemGPT; tiered memory (core/archival/recall).
MemGPT paper — LLMs as Operating Systems — The paged-memory paper that started the wave.
Mem0 docs — Drop-in memory layer with extraction/consolidation.
Zep / Graphiti — Bi-temporal knowledge graph for memory with fact validity windows.
getzep/graphiti — The temporal-graph engine standalone.
LangMem — Semantic / episodic / procedural primitives over LangGraph stores.
Generative Agents (Park et al.) — Reflection + episodic memory; still the best single read.

6. Sandboxing & code execution

E2B docs — Firecracker microVM sandboxes; the de-facto hosted choice.
Modal Sandboxes — gVisor + filesystem snapshots; good for batch fleets.
Daytona docs — OSS sandbox repurposed for agents; sub-200ms cold start claims.
Cloudflare — Containers for Agents — Per-agent containers tied to Durable Objects.
cloudflare/sandbox-sdk — Reference SDK for spawning sandboxes from Workers.
apple/container — Native macOS container runtime; useful for local agent dev.
hyperlight-dev/hyperlight — Microsoft's sub-millisecond WASM/VM micro-sandbox.
gVisor docs — User-space kernel; understand it before trusting "sandboxed" claims.

7. Inference & gateway infrastructure

vLLM docs — Highest-throughput OSS inference; PagedAttention + prefix caching.
SGLang — RadixAttention; great for tool-using agents that share prefixes.
Hugging Face TGI — Mature self-hosted with constrained decoding.
LiteLLM — 100+ provider proxy; OpenAI-shaped API; the boring-but-essential routing layer.
Portkey AI Gateway — OSS gateway with guardrails, caching, conditional routing.
OpenRouter docs — Hosted multi-provider routing.
Anthropic — Prompt caching — Cache-key design, 5-min TTL, 85% latency reduction reference.
OpenAI — Latency optimization — TTFT vs total time; streaming patterns.

8. Evaluation — philosophy (read these first)

Your AI Product Needs Evals — Hamel Husain. The canonical "stop vibe-checking, start measuring."
A Field Guide to Rapidly Improving AI Products — Hamel Husain. Error-analysis loops, eval-driven iteration.
LLM Evals: Everything You Need to Know — Hamel Husain. FAQ from the Hamel/Shreya course.
Task-Specific LLM Evals That Do & Don't Work — Eugene Yan. Why ROUGE/BLEU/BERTScore mislead.
LLM-Evaluators a.k.a. LLM-as-Judge — Eugene Yan. Pairwise vs pointwise, position bias, calibration.
SPADE (Shankar et al.) — Auto-synthesized assertions from prompt deltas.
Who Validates the Validators? (EvalGen) — Shankar et al. Critical paper on grader drift.
Judging LLM-as-a-Judge (MT-Bench) — The original position/verbosity/self-preference bias paper.
Low-Hanging Fruit for RAG Search — Jason Liu. Retrieval-side instrumentation.

9. Evaluation — frameworks & benchmarks

Inspect AI — UK AISI. OS. Best-in-class for agent evals; sandboxed tool use, MCP support, used by Anthropic/DeepMind.
UKGovernmentBEIS/inspect_ai — Source.
OpenAI Evals — OS. Original registry-of-evals framework.
Promptfoo — OS + SaaS. YAML matrix testing + red-team module; CI-friendly.
DeepEval — OS. Pytest-style assertions + 14 default metrics.
Ragas — OS. RAG-specific metrics standard.
LangSmith Evaluations — SaaS.
Braintrust — SaaS. Hill-climbing dev loop with strong DX.
Arize Phoenix — OS. OTel-native traces + evals.
Langfuse Evaluations — OS + SaaS.
Patronus AI — SaaS. Managed judge models (Lynx for hallucination).
Benchmarks: SWE-bench, SWE-bench Verified, GAIA, τ-bench, WebArena, OSWorld, MLE-bench, SWE-Lancer.

10. Observability & tracing

OpenTelemetry GenAI Semantic Conventions — Foundational. The vendor-neutral schema for LLM/agent spans. Build to this and swap backends.
OTel GenAI Metrics Spec — Standard metric names (gen_ai.client.token.usage, etc.).
OpenLLMetry — OS. OTel SDK + auto-instrumentation for LLM/vector/agent libs.
Langfuse — OS + SaaS. Self-hostable observability + evals + prompt mgmt.
Arize Phoenix — OS. OpenInference traces; runs locally.
LangSmith Tracing — SaaS. Framework-agnostic via SDK despite the name.
Helicone — OS + SaaS. Proxy-based logging — lowest-friction integration.
Honeycomb — Observability for AI Applications — Best engineering content on high-cardinality LLM tracing.
Datadog LLM Observability — SaaS. Strongest if already on Datadog.

11. Production testing patterns & cost/latency

Promptfoo in CI (GitHub Action) — Block PRs on eval regressions.
LangSmith — Online evaluations — Sampling prod traces back into eval datasets (the data flywheel).
Honeycomb — We shipped AI — Honest postmortem-style writing on shadow traffic + Query Assistant.
Langfuse — Cost tracking — Per-trace, per-user, per-prompt cost attribution.
Helicone — Caching dashboards — Per-route token spend + cache hit rates.

12. Security for agents

OWASP Top 10 for LLM Applications 2025 — Use as a checklist.
Embrace The Red — Johann Rehberger. The best running blog on real agent exploits.
Simon Willison — prompt injection tag — Ongoing curation of every notable incident.
The lethal trifecta — Simon Willison. The threat model in one essay.
CaMeL: Defeating Prompt Injections by Design — Google DeepMind. Capabilities-based dual-LLM design; strongest published defense pattern.
Anthropic Responsible Scaling Policy — Frontier-lab safety framework; a template for your own deployment gates.
NIST AI RMF + Generative AI Profile — Risk-management vocabulary auditors will use.
MITRE ATLAS — ATT&CK-style matrix for ML/agent threats.
Trail of Bits — Prompt injection to RCE in AI agents — Recent, concrete RCE chain.

13. Coding agent infrastructure (read for harness design even if not building one)

Best Practices for Claude Code — Anthropic. CLAUDE.md, tools, harness design.
Claude Agent SDK overview — Anthropic. Hooks (PreToolUse/PostToolUse/Stop/etc.), tool allowlists, custom tools as in-process MCP.
anthropics/claude-agent-sdk-python — Source.
Cognition — How Cognition uses Devin to build Devin — Internal dogfooding patterns.
Cognition — Multi-Agents: What's Actually Working — The pragmatic update to "Don't Build Multi-Agents."
Aider blog — Repo-map, edit formats; the leaderboard is one of the best practical evals.
Sourcegraph Amp — engineering posts — Long-form on tool design and oracle patterns.
openai/codex — Reference open-source coding-agent CLI.
Geoffrey Huntley — how to build a coding agent (workshop) — Free workshop on building one from scratch.
Geoffrey Huntley — Ralph Wiggum loop — The brute-force feedback-loop pattern essay.
Open SWE: An Open-Source Framework for Internal Coding Agents — LangChain. Open-source SWE-agent framework built on LangGraph; covers core architectural components — task manager, programmer agent, and sandboxed execution — for deploying internal coding agents at scale.

14. Backend-specific agent patterns (SRE, K8s, IaC)

k8sgpt — CNCF Sandbox. Read-only K8s diagnosis agent; canonical example.
HolmesGPT — Robusta. On-call investigation agent with runbooks-as-tools.
Datadog — Bits AI SRE — Datadog's autonomous incident-response agent design.
Honeycomb — Building an AI Agent for Observability — Hard-won lessons on agents over telemetry data.
HashiCorp — Terraform MCP server — Reference IaC tool surface.
Anthropic — How we built our multi-agent research system — Best multi-agent case study, period.

Worth following for ongoing signal

Anthropic Engineering — Engineering posts from the model-maker — agents, tooling, and operational patterns.
Cognition blog — Case studies on building Devin; the productive disagreement to "Don't Build Multi-Agents."
Simon Willison — The field's running curator — daily-ish takes on LLMs, agents, and prompt injection.
Embrace The Red — Johann Rehberger's red-team blog — concrete agent exploits and bypass techniques.
Cloudflare AI agents tag — Agent-runtime and infra posts from Cloudflare's edge platform.
LangChain blog — Framework updates, multi-agent patterns, and ecosystem news.
Hamel's blog — Applied LLM engineering and evals from a practitioner perspective.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.agents/skills		.agents/skills
.claude/skills		.claude/skills
.github/workflows		.github/workflows
docs/specs		docs/specs
scout		scout
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
resources.yaml		resources.yaml
skills-lock.json		skills-lock.json
sources.yaml		sources.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Engineering Workflow — A Curated Resource Guide

0. If you only read 7 things

1. Foundational design & "what is an agent"

2. Tool integration & MCP

3. Multi-agent orchestration frameworks

4. Durable execution for agents

5. Memory systems

6. Sandboxing & code execution

7. Inference & gateway infrastructure

8. Evaluation — philosophy (read these first)

9. Evaluation — frameworks & benchmarks

10. Observability & tracing

11. Production testing patterns & cost/latency

12. Security for agents

13. Coding agent infrastructure (read for harness design even if not building one)

14. Backend-specific agent patterns (SRE, K8s, IaC)

Worth following for ongoing signal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Engineering Workflow — A Curated Resource Guide

0. If you only read 7 things

1. Foundational design & "what is an agent"

2. Tool integration & MCP

3. Multi-agent orchestration frameworks

4. Durable execution for agents

5. Memory systems

6. Sandboxing & code execution

7. Inference & gateway infrastructure

8. Evaluation — philosophy (read these first)

9. Evaluation — frameworks & benchmarks

10. Observability & tracing

11. Production testing patterns & cost/latency

12. Security for agents

13. Coding agent infrastructure (read for harness design even if not building one)

14. Backend-specific agent patterns (SRE, K8s, IaC)

Worth following for ongoing signal

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages