Backend & infrastructure focus. Agent/language-agnostic. Continuously maintained — links verified weekly.
- Building Effective Agents — Anthropic (Schluntz/Zhang). The "workflows vs agents" mental model that everything else builds on.
- Effective context engineering for AI agents — Anthropic. The successor concept to prompt engineering; defines the actual job.
- 12-Factor Agents — Dex Horthy / HumanLayer. Heroku's 12-factor reframed for LLM systems; the canonical "agents are mostly software" doctrine.
- How we built our multi-agent research system — Anthropic. The best single multi-agent case study, with concrete failure modes.
- Don't Build Multi-Agents — Cognition. Read alongside #4 — the productive disagreement at the heart of agent architecture in 2025/26.
- A practical guide to building agents (PDF) — OpenAI. The complementary canonical from the other lab.
- What We Learned From a Year of Building with LLMs — Yan/Bischof/Frye/Husain/Liu/Shankar. Tactical → operational → strategic; the field's distilled playbook.
- Building Effective Agents — Anthropic. Workflows (orchestrator/router/parallel/eval-optimizer) vs true agents.
- Effective context engineering for AI agents — Anthropic. Compaction, sub-agents, structured note-taking.
- Writing effective tools for AI agents — Anthropic. Tools as contracts between deterministic and non-deterministic systems; eval-driven tool refinement.
- Effective harnesses for long-running agents — Anthropic. Initializer + worker pattern for cross-context continuity (claude-progress.txt + git).
- How Anthropic teams use Claude Code — Anthropic. Internal-team patterns across infra, security, data science.
- Building agents with the Claude Agent SDK — Anthropic.
- A practical guide to building agents (PDF) — OpenAI. Use cases, design foundations, multi-agent, guardrails.
- 12-Factor Agents — Dex Horthy. Coined "context engineering" in April 2025.
- 12-Factor Agents talk (YouTube) — Dex Horthy. ~30 min version of the README.
- Don't Build Multi-Agents — Cognition. Context fragmentation as the core failure mode.
- Agentic Engineering Patterns — Simon Willison. Living catalog of patterns for code-generating-and-executing agents.
- The lethal trifecta for AI agents — Simon Willison. Private data + untrusted content + exfiltration channel = the agent threat model.
- Simon Willison's ai-agents tag — Best running curation in the field.
- Patterns for Building LLM-based Systems & Products — Eugene Yan. Evals/RAG/Cache/Guardrails/Defensive-UX.
- Building A Generative AI Platform — Chip Huyen. Reference architecture: gateway → routing → cache → guardrails → telemetry.
- Model Context Protocol — Specification — Latest spec; JSON-RPC 2.0; tools/resources/prompts primitives.
- MCP — Architecture overview — Mental model.
- MCP — Authorization — OAuth flow; required reading before exposing MCP servers.
- The 2026 MCP Roadmap — Where the protocol is headed.
- modelcontextprotocol/servers — Official reference implementations.
- wong2/awesome-mcp-servers and appcypher/awesome-mcp-servers — Two best-maintained registries.
- Anthropic — Tool use docs — Schema design, parallel calls, structured outputs.
- Announcing the Agent2Agent Protocol (A2A) — Google. Agent-to-agent (vs agent-to-tool) protocol.
- A2A specification — Now under Linux Foundation; gRPC support since v0.3.
- LangGraph overview — Stateful graph runtime; 1.0 shipped Oct 2025.
- LangGraph — persistence & checkpointing — Threads, checkpointers, cross-thread memory.
- LangGraph — human-in-the-loop —
interrupt(), time-travel, approval gates. - LangGraph — multi-agent systems — Supervisor, swarm, hierarchical patterns.
- OpenAI Agents SDK (Python) — Successor to Swarm.
- openai/openai-agents-python — Source.
- Microsoft Agent Framework — Successor to AutoGen + Semantic Kernel; .NET/Python.
- AutoGen v0.4 — Asynchronous actor-model multi-agent runtime.
- Google ADK (Agent Development Kit) docs — Open-source, multi-language (Python/TS/Go/Java).
- google/adk-python — Source.
- CrewAI docs — Role/task/crew abstraction; lighter than LangGraph.
- Pydantic AI — Type-safe agents with DI; the pleasant Python option.
- smolagents — Hugging Face. Minimalist code-acting agent library.
- Mastra — TypeScript-first.
- Inngest AgentKit — TS framework on top of Inngest's durable runtime.
- Temporal — Build resilient Agentic AI with Temporal — Why agent loops belong in workflow engines.
- Temporal — Durable Execution meets AI — Tools as activities, signals for HITL, child workflows for sub-agents.
- temporal-community/temporal-ai-agent — Reference implementation.
- Inngest — Durable Execution: The Key to Harnessing AI Agents in Production — Step functions wrapping LLM calls.
- Restate — AI agents — Lightweight durable execution; agents as virtual objects.
- Hatchet — Durable Tasks — Postgres-backed task queue with agent-aware patterns (agentic loops, HITL).
- DBOS — Durable Execution for Building Crashproof AI Agents — Postgres-as-runtime; smaller-team alternative to Temporal.
- Anthropic — Memory tool — First-party file-backed memory.
- Letta docs — Production fork of MemGPT; tiered memory (core/archival/recall).
- MemGPT paper — LLMs as Operating Systems — The paged-memory paper that started the wave.
- Mem0 docs — Drop-in memory layer with extraction/consolidation.
- Zep / Graphiti — Bi-temporal knowledge graph for memory with fact validity windows.
- getzep/graphiti — The temporal-graph engine standalone.
- LangMem — Semantic / episodic / procedural primitives over LangGraph stores.
- Generative Agents (Park et al.) — Reflection + episodic memory; still the best single read.
- E2B docs — Firecracker microVM sandboxes; the de-facto hosted choice.
- Modal Sandboxes — gVisor + filesystem snapshots; good for batch fleets.
- Daytona docs — OSS sandbox repurposed for agents; sub-200ms cold start claims.
- Cloudflare — Containers for Agents — Per-agent containers tied to Durable Objects.
- cloudflare/sandbox-sdk — Reference SDK for spawning sandboxes from Workers.
- apple/container — Native macOS container runtime; useful for local agent dev.
- hyperlight-dev/hyperlight — Microsoft's sub-millisecond WASM/VM micro-sandbox.
- gVisor docs — User-space kernel; understand it before trusting "sandboxed" claims.
- vLLM docs — Highest-throughput OSS inference; PagedAttention + prefix caching.
- SGLang — RadixAttention; great for tool-using agents that share prefixes.
- Hugging Face TGI — Mature self-hosted with constrained decoding.
- LiteLLM — 100+ provider proxy; OpenAI-shaped API; the boring-but-essential routing layer.
- Portkey AI Gateway — OSS gateway with guardrails, caching, conditional routing.
- OpenRouter docs — Hosted multi-provider routing.
- Anthropic — Prompt caching — Cache-key design, 5-min TTL, 85% latency reduction reference.
- OpenAI — Latency optimization — TTFT vs total time; streaming patterns.
- Your AI Product Needs Evals — Hamel Husain. The canonical "stop vibe-checking, start measuring."
- A Field Guide to Rapidly Improving AI Products — Hamel Husain. Error-analysis loops, eval-driven iteration.
- LLM Evals: Everything You Need to Know — Hamel Husain. FAQ from the Hamel/Shreya course.
- Task-Specific LLM Evals That Do & Don't Work — Eugene Yan. Why ROUGE/BLEU/BERTScore mislead.
- LLM-Evaluators a.k.a. LLM-as-Judge — Eugene Yan. Pairwise vs pointwise, position bias, calibration.
- SPADE (Shankar et al.) — Auto-synthesized assertions from prompt deltas.
- Who Validates the Validators? (EvalGen) — Shankar et al. Critical paper on grader drift.
- Judging LLM-as-a-Judge (MT-Bench) — The original position/verbosity/self-preference bias paper.
- Low-Hanging Fruit for RAG Search — Jason Liu. Retrieval-side instrumentation.
- Inspect AI — UK AISI. OS. Best-in-class for agent evals; sandboxed tool use, MCP support, used by Anthropic/DeepMind.
- UKGovernmentBEIS/inspect_ai — Source.
- OpenAI Evals — OS. Original registry-of-evals framework.
- Promptfoo — OS + SaaS. YAML matrix testing + red-team module; CI-friendly.
- DeepEval — OS. Pytest-style assertions + 14 default metrics.
- Ragas — OS. RAG-specific metrics standard.
- LangSmith Evaluations — SaaS.
- Braintrust — SaaS. Hill-climbing dev loop with strong DX.
- Arize Phoenix — OS. OTel-native traces + evals.
- Langfuse Evaluations — OS + SaaS.
- Patronus AI — SaaS. Managed judge models (Lynx for hallucination).
- Benchmarks: SWE-bench, SWE-bench Verified, GAIA, τ-bench, WebArena, OSWorld, MLE-bench, SWE-Lancer.
- OpenTelemetry GenAI Semantic Conventions — Foundational. The vendor-neutral schema for LLM/agent spans. Build to this and swap backends.
- OTel GenAI Metrics Spec — Standard metric names (
gen_ai.client.token.usage, etc.). - OpenLLMetry — OS. OTel SDK + auto-instrumentation for LLM/vector/agent libs.
- Langfuse — OS + SaaS. Self-hostable observability + evals + prompt mgmt.
- Arize Phoenix — OS. OpenInference traces; runs locally.
- LangSmith Tracing — SaaS. Framework-agnostic via SDK despite the name.
- Helicone — OS + SaaS. Proxy-based logging — lowest-friction integration.
- Honeycomb — Observability for AI Applications — Best engineering content on high-cardinality LLM tracing.
- Datadog LLM Observability — SaaS. Strongest if already on Datadog.
- Promptfoo in CI (GitHub Action) — Block PRs on eval regressions.
- LangSmith — Online evaluations — Sampling prod traces back into eval datasets (the data flywheel).
- Honeycomb — We shipped AI — Honest postmortem-style writing on shadow traffic + Query Assistant.
- Langfuse — Cost tracking — Per-trace, per-user, per-prompt cost attribution.
- Helicone — Caching dashboards — Per-route token spend + cache hit rates.
- OWASP Top 10 for LLM Applications 2025 — Use as a checklist.
- Embrace The Red — Johann Rehberger. The best running blog on real agent exploits.
- Simon Willison — prompt injection tag — Ongoing curation of every notable incident.
- The lethal trifecta — Simon Willison. The threat model in one essay.
- CaMeL: Defeating Prompt Injections by Design — Google DeepMind. Capabilities-based dual-LLM design; strongest published defense pattern.
- Anthropic Responsible Scaling Policy — Frontier-lab safety framework; a template for your own deployment gates.
- NIST AI RMF + Generative AI Profile — Risk-management vocabulary auditors will use.
- MITRE ATLAS — ATT&CK-style matrix for ML/agent threats.
- Trail of Bits — Prompt injection to RCE in AI agents — Recent, concrete RCE chain.
- Best Practices for Claude Code — Anthropic. CLAUDE.md, tools, harness design.
- Claude Agent SDK overview — Anthropic. Hooks (PreToolUse/PostToolUse/Stop/etc.), tool allowlists, custom tools as in-process MCP.
- anthropics/claude-agent-sdk-python — Source.
- Cognition — How Cognition uses Devin to build Devin — Internal dogfooding patterns.
- Cognition — Multi-Agents: What's Actually Working — The pragmatic update to "Don't Build Multi-Agents."
- Aider blog — Repo-map, edit formats; the leaderboard is one of the best practical evals.
- Sourcegraph Amp — engineering posts — Long-form on tool design and oracle patterns.
- openai/codex — Reference open-source coding-agent CLI.
- Geoffrey Huntley — how to build a coding agent (workshop) — Free workshop on building one from scratch.
- Geoffrey Huntley — Ralph Wiggum loop — The brute-force feedback-loop pattern essay.
- Open SWE: An Open-Source Framework for Internal Coding Agents — LangChain. Open-source SWE-agent framework built on LangGraph; covers core architectural components — task manager, programmer agent, and sandboxed execution — for deploying internal coding agents at scale.
- k8sgpt — CNCF Sandbox. Read-only K8s diagnosis agent; canonical example.
- HolmesGPT — Robusta. On-call investigation agent with runbooks-as-tools.
- Datadog — Bits AI SRE — Datadog's autonomous incident-response agent design.
- Honeycomb — Building an AI Agent for Observability — Hard-won lessons on agents over telemetry data.
- HashiCorp — Terraform MCP server — Reference IaC tool surface.
- Anthropic — How we built our multi-agent research system — Best multi-agent case study, period.
- Anthropic Engineering — Engineering posts from the model-maker — agents, tooling, and operational patterns.
- Cognition blog — Case studies on building Devin; the productive disagreement to "Don't Build Multi-Agents."
- Simon Willison — The field's running curator — daily-ish takes on LLMs, agents, and prompt injection.
- Embrace The Red — Johann Rehberger's red-team blog — concrete agent exploits and bypass techniques.
- Cloudflare AI agents tag — Agent-runtime and infra posts from Cloudflare's edge platform.
- LangChain blog — Framework updates, multi-agent patterns, and ecosystem news.
- Hamel's blog — Applied LLM engineering and evals from a practitioner perspective.