A 12-week curriculum for building, breaking, and understanding the foundational systems that everything else depends on.
This is not a tutorial. It is a disciplined encounter with primary sources, hand-written code, and the historical constraints that shaped the systems we now take for granted. Each phase produces working code, a break-it report, and an article. Together they form an educational resource grounded in direct experience rather than borrowed abstractions.
Anyone can follow this curriculum. The rules are the same for everyone.
No generative AI in the build process. You may use AI to discuss concepts, clarify questions, or rubber-duck a problem. You do not use it to write code, generate boilerplate, autocomplete functions, or scaffold projects. The point is to force your own thinking through the resistance of the material. If you cannot write it, you do not understand it.
Read primary sources before building. Every phase has a required reading list of RFCs, specifications, papers, and reference implementations. Read them before you write your first line of code. Not skim. Read. Take notes by hand if it helps. The goal is to understand the design decisions and constraints that shaped the systems you are about to rebuild.
Understand the history. Every system was built under specific constraints: the hardware that existed, the networks that were available, the threat models that mattered, the assumptions that seemed reasonable at the time. Many of those constraints no longer hold. Understanding what has changed is where insight lives. Each phase includes a historical context section. Engage with it seriously.
Hand-code everything. No frameworks, no scaffolding tools, no starter templates. You write your Makefile or build script. You write your data structures. You write your tests. When you need a library (e.g., a crypto library for SHA-256, libpcap for packet capture), that is fine, but you understand its interface thoroughly before calling it, and you never use a library to avoid understanding the concept it implements.
Test rigorously. Write tests as you go, not after. When something breaks, resist the urge to fiddle until it works. Read the error, form a hypothesis, verify it, then fix it. That cycle is the actual learning.
Build, break, fix. Every phase ends with a break-it day. You spend that day attacking what you built: finding vulnerabilities, exploiting edge cases, gaming metrics, crashing systems. The break-it day is not optional. It is where the deepest learning concentrates. If you cannot break what you built, you either built something robust (unlikely in 2-3 weeks) or you do not yet understand where it is fragile (much more likely).
Commit with discipline. Every commit message explains what changed and why. Not "fixed bug" but "fixed off-by-one in page table index calculation: was mapping virtual pages to wrong physical frames because index started at 1 instead of 0." Your commit history should be readable as a narrative of your learning.
Write about what you learn. One substantial article per phase. The article is the argument. The code is the evidence. Together they form a complete resource.
This is a single repository. Each phase is a directory. Your reading notes, architectural decisions, break-it reports, and articles live alongside the code they describe.
systems-depth/
README.md # This document
CONTRIBUTORS.md # Who is doing this, what phase they are on
phase-1-storage/
README.md # Phase overview and what you learned
docs/
HISTORICAL-CONTEXT.md # The era, constraints, and choices that shaped these systems
READING-NOTES.md # Your notes on primary sources, in your own words
DECISIONS.md # Architectural Decision Records
BREAK-IT.md # Break-it day findings
src/
tests/
Makefile
article.md # The published writeup for this phase
phase-2-networking/
[same structure]
phase-3-kernel/
[same structure]
phase-4-eval-harness/
[same structure]
phase-5-llm-agent/
[same structure]
Architectural Decision Records (ADRs): For every non-trivial design choice, write a short record: what the decision was, what alternatives you considered, why you chose what you chose, and what tradeoffs you accepted.
## ADR-001: Hash function selection for content addressing
**Context:** Need a hash function for content-addressable storage keys.
**Decision:** Using SHA-256 via [library].
**Alternatives considered:** SHA-1 (collision vulnerability), BLAKE2 (faster but less
ubiquitous), SHA-3 (newer but less ecosystem support).
**Rationale:** SHA-256 is the standard for content addressing in systems I want to
interoperate with (Git, Bitcoin, IPFS). Speed is not the bottleneck for this project.
**Tradeoffs accepted:** Slower than BLAKE2. Acceptable for this use case.
Reading notes: For every primary source you read, write a short summary in your own words: what the document says, what design constraints it reveals, what surprised you, and what you disagree with or find dated.
Fork the repository. Follow the rules. Do the reading. Write the code. Break what you build. Write about what you learned.
You do not need permission. You do not need to be at any particular skill level. You need to be willing to read primary sources, hand-write code, and be honest about what you do not understand.
If you complete a phase, open a PR that adds your name to CONTRIBUTORS.md with a link to your fork and a one-line note on what you learned. The goal is not uniformity. It is a growing collection of serious, independent engagements with the same foundational material.
Duration: 12 weeks, evenings and weekends. This is supplementary to whatever your main work is.
Governing rule: Every Monday, ask whether last week's curriculum work made you sharper at your actual work. Two detour weeks in a row means stop and ship something real.
Weeks 1-3
Every system that persists information makes choices about integrity, identity, and trust, usually implicitly. This phase makes those choices explicit by building storage from the ground up, then adding cryptographic integrity, then constructing the append-only chain structure that underpins provenance systems, blockchains, and version control.
Content-addressable storage and Merkle trees emerged from two distinct traditions that converged.
Ralph Merkle invented his tree structure in 1979 in the context of digital signatures. The constraint was computational: public key cryptography was expensive, and Merkle needed a way to authenticate many data blocks without signing each one individually. The tree structure allowed a single root hash to vouch for an arbitrarily large dataset. This was a clever response to the hardware limitations of the era.
Git, designed by Linus Torvalds in 2005, adopted content-addressable storage not for cryptographic reasons but for performance and correctness in distributed version control. The constraint was coordination: thousands of developers working on the Linux kernel needed to merge changes without a central authority. Content addressing made deduplication automatic and integrity verification trivial.
Bitcoin, in 2008, combined Merkle trees with hash chains and proof-of-work to solve a different problem entirely: consensus without trust. The constraint was adversarial: participants could not trust each other, so the system needed to make cheating economically irrational. Nakamoto's insight was that the data structure (hash chain) was the easy part. The hard part was consensus.
What has changed since these systems were designed:
Storage is effectively free. Merkle's original concern about efficiency still matters for verification, but the cost tradeoffs around what to store have shifted dramatically.
Computation is cheap enough that some of the original "too expensive" approaches to integrity and authenticity are now practical.
The trust problem has inverted. In 1979, the question was "how do I prove this data is authentic?" In 2026, the question is increasingly "how do I prove this data was produced by a human, and that the human's process is visible?" This is the space that provenance systems like Chain of Meaning occupy. The primitives are the same. The threat model is different.
Understanding this history matters because the data structures you are about to build were shaped by constraints that may no longer hold. Recognizing which constraints have expired and which persist is where engineering judgment lives.
Specifications and papers:
Merkle, R. "A Digital Signature Based on a Conventional Encryption Function" (1987). The original Merkle tree paper. Short and readable. Understand why it was invented (efficient digital signatures) before you use it for integrity verification.
Nakamoto, S. "Bitcoin: A Peer-to-Peer Electronic Cash System" (2008). Sections 2-4 and 7 only. Not for the currency. For the clearest practical description of how hash chains, Merkle trees, and proof-of-work compose into an integrity system. Pay attention to what is assumed vs. what is proven.
The Git book, Chapter 10: "Git Internals." Read the object model section carefully. Git is a content-addressable storage system with a DAG structure. Understand how it hashes objects, how trees reference blobs, and how commits chain.
Schneier, B. Relevant chapters from "Applied Cryptography" or Ferguson/Schneier "Cryptography Engineering" on hash functions, MACs, and digital signatures. You need to understand what SHA-256 guarantees and what it does not, what a collision means practically, and the difference between integrity, authenticity, and non-repudiation.
Reference implementations to study (read, not copy):
Git source code: hash-object, cat-file, and the object storage internals.
LevelDB or RocksDB design documents (LSM tree architecture, write-ahead log design).
IPFS content addressing specification.
Build a simple key-value store from scratch in C or Rust. No libraries for the core data structures.
Build:
- In-memory hash map with get/set/delete
- Write-ahead log for persistence (write operations to disk before confirming)
- Recovery on restart from the log
- Basic benchmarking: how many ops/sec, what happens under load
Key concepts to internalize:
- Hashing for lookup vs. hashing for integrity (different threat models)
- Write amplification and the cost of durability
- What "atomic" actually means when the power goes out mid-write
Rebuild storage as content-addressable. Implement the Merkle tree yourself.
Build:
- SHA-256 hashing of content to produce keys (implement the addressing scheme, use a crypto library for the hash itself)
- Merkle tree over stored entries: leaf nodes are content hashes, interior nodes are hashes of children
- Root hash as a single integrity witness for the entire store
- Verification: given a leaf, produce the proof path to the root
- Insert new content and update the tree
Key concepts to internalize:
- What a Merkle proof actually guarantees and what it does not
- The difference between integrity (nobody tampered) and authenticity (I know who wrote it)
- Why content-addressable storage makes deduplication trivial
- Where the trust anchor actually lives (who holds the root hash?)
Extend into an append-only log with hash-chained entries.
Build:
- Each entry contains: content, timestamp, hash of previous entry, hash of self
- Append-only constraint enforced at the data structure level
- Verification: walk the chain and confirm integrity from any point back to genesis
- Simulate a "fork" (two entries claiming the same parent) and detect it
- Add a signature layer (sign each entry with a keypair) to move from integrity to authenticity
Key concepts to internalize:
- Append-only as a design constraint vs. append-only as an enforcement reality
- What happens when you need to redact something from an integrity-protected log
- The gap between "hash chain" and "blockchain" (consensus, not data structure, is the hard part)
- How this maps to authorship tracking: keystroke logs as append-only provenance chains
Attack your own system:
- Try to corrupt the log in ways that pass validation
- Try to forge an entry that appears legitimate
- Try to rewrite history without detection
- Try to cause a hash collision (you will not succeed, but understand exactly why)
- Find edge cases: what if two entries arrive simultaneously? What if the process crashes mid-append?
- Try to exploit your trust model: if you control the root hash, what can you get away with?
Document everything in BREAK-IT.md.
Weeks 4-6
Systems under real-world constraints. Packets get dropped, reordered, delayed, and forged. This is where you learn that state in transit is fundamentally harder than state at rest.
The Internet Protocol suite was designed in the 1970s and early 1980s under constraints that have mostly vanished, but whose engineering consequences persist in every packet you will ever touch.
RFC 791 (IP, 1981) and RFC 793 (TCP, 1981) were written for a network of a few hundred machines, mostly at universities and military installations, connected by unreliable links with wildly varying bandwidth. The designers faced several constraints that shaped the protocols profoundly.
Trust was ambient. The original internet assumed cooperative participants. There was no concept of a malicious node because the network was small and its users were known. IP has no authentication. TCP's sequence numbers were designed for reliability, not security. This ambient trust assumption is the root cause of entire categories of attack that emerged once the network grew beyond its original community.
Bandwidth was scarce and variable. TCP's flow control and congestion avoidance mechanisms were designed for a world where a fast link was 56 kbps. The sliding window, slow start, and congestion avoidance algorithms are elegant responses to scarcity that still govern how data moves across gigabit links today.
Memory was expensive. The IP fragmentation mechanism (which you will exploit on break-it day) exists because routers in 1981 could not afford to buffer large packets. Fragmentation allows a router to break a packet into smaller pieces that fit the next link's MTU. This was a reasonable engineering choice given the constraint. It has been a source of security vulnerabilities for 40 years.
End-to-end principle. Saltzer, Reed, and Clark's "End-to-End Arguments in System Design" (1984) formalized the philosophy that shaped TCP/IP: keep the network simple, push complexity to the endpoints. This was a deliberate rejection of the telephone network's model (smart network, dumb endpoints). It is arguably the most consequential architectural decision in the history of computing.
NAT (Network Address Translation, formalized in the 1990s) was a patch for IPv4 address exhaustion. It was never intended as a security mechanism, but it accidentally became one because it hides internal network structure. Understanding NAT as a historical accident rather than a deliberate design helps you reason about what it actually protects and what it does not.
What has changed:
The network is adversarial. The ambient trust assumption is completely false. Every packet you inspect may be crafted by someone trying to evade your firewall, exfiltrate data, or crash your system.
Bandwidth is abundant but latency still matters. The flow control mechanisms designed for scarcity now operate in a world of surplus, which changes the tradeoff space.
IPv6 eliminates fragmentation at the router level (only endpoints fragment), which closes some attack vectors but opens others.
Encryption is ubiquitous. Most traffic is now encrypted (TLS), which means a packet inspector sees headers but not payloads. This fundamentally changes what a firewall can and cannot do.
The constraints that created TCP/IP's specific engineering choices have largely expired, but the protocols persist because switching costs are enormous. Recognizing which design decisions were responses to bygone constraints, and which reflect genuinely timeless principles (like the end-to-end argument), is a skill that transfers far beyond networking.
RFCs (read these, not summaries of them):
RFC 791: Internet Protocol (IPv4). The actual specification of how IP packets are structured, fragmented, and reassembled. Read the header format byte by byte. When you parse packets in Week 4, you should be working from this document.
RFC 793: Transmission Control Protocol. The full TCP specification. Pay particular attention to the state machine (Section 3.2, 3.4, 3.5), the segment header format, and the connection establishment/teardown sequences. Your firewall's connection tracking will be an implementation of this state machine.
RFC 768: User Datagram Protocol. Short. Read it entirely. Understand what UDP does not guarantee and why that matters for firewall design.
RFC 2663: IP Network Address Translator (NAT) Terminology and Considerations. Understand what NAT does to the assumptions in the above protocols.
RFC 6973: Privacy Considerations for Internet Protocols. Read Sections 3-5. This contextualizes packet inspection as a dual-use capability.
Saltzer, J., Reed, D., Clark, D. "End-to-End Arguments in System Design" (1984). One of the most important papers in systems design. Read it.
Papers and reference material:
Mogul, J. and Deering, S. "Path MTU Discovery" (RFC 1191). Understand fragmentation and why it creates security edge cases.
Roesch, M. "Snort: Lightweight Intrusion Detection for Networks" (1999). Read for architecture.
Cheswick, W. and Bellovin, S. "Firewalls and Internet Security" (selected chapters). The foundational text on firewall design philosophy.
Reference implementations to study:
libpcap source and documentation.
Netfilter/iptables architecture documentation.
tcpdump source code.
Build a tool that captures and inspects network packets.
Build:
- Raw socket or libpcap-based packet capture
- Parse Ethernet, IP, TCP, and UDP headers from raw bytes, referencing the RFCs directly as you implement
- Display packet metadata: source/dest IP, ports, protocol, flags, payload size
- Filter by protocol, port, or IP
- Log to your append-only store from Phase 1 (first integration point)
Key concepts to internalize:
- The actual byte layout of a packet (not the abstraction, the bytes)
- How TCP state works: SYN, SYN-ACK, ACK and what each flag actually means
- The difference between what an application "sends" and what the wire carries
- Why packet inspection is both a security tool and a surveillance tool
Extend the inspector into a firewall with rule evaluation and connection tracking.
Build:
- Rule engine: allow/deny based on source, destination, port, protocol
- Connection tracking table: track TCP connection state (NEW, ESTABLISHED, RELATED) per RFC 793's state machine
- Stateful filtering: allow return traffic for established connections without explicit rules
- Handle concurrent connections (your first real concurrency challenge in the curriculum)
- Rule evaluation order matters: implement and understand why
Key concepts to internalize:
- The difference between stateless packet filtering and stateful inspection
- Why connection tracking is expensive and what happens when the table fills up
- How NAT works (and why it breaks certain assumptions)
- The tradeoff between security granularity and performance
Extend and harden.
Build:
- Rate limiting per source IP
- Logging of blocked traffic with reasons
- Handle fragmented packets per RFC 791's fragmentation specification
- Handle malformed packets gracefully (do not crash, log and drop)
- Basic reporting: what is being blocked, from where, how often
Key concepts to internalize:
- Why default-deny is fundamentally different from default-allow
- What "fail open" vs. "fail closed" means for a security boundary
- How real attackers probe firewalls (slow scans, fragmentation, protocol abuse)
Attack your own firewall:
- Craft packets that exploit edge cases in your state tracking
- Try fragmentation attacks (overlapping fragments, tiny fragments, fragment floods)
- Try to exhaust your connection table
- Try to sneak traffic through by manipulating TCP flags (ACK scans, FIN scans, XMAS scans)
- Try timing-based evasion (slow drip of data below your rate limit thresholds)
- Try to cause your firewall to fail open under load
Document everything in BREAK-IT.md.
Weeks 7-8
Machine-level respect. You stop taking abstractions for granted. Even a very small kernel changes how you reason about every system you build above it.
The x86 architecture you will be writing for carries 45 years of accumulated design decisions, many of which were responses to constraints that no longer exist but whose consequences are baked into every processor shipping today.
The original 8086 (1978) was a 16-bit processor designed to be backward-compatible with the 8080 while extending its capabilities. Intel's constraint was market: they needed existing 8080 customers to migrate. This backward-compatibility imperative has driven x86 design decisions ever since. When you set up your GDT and enter protected mode, you are performing a ritual that exists because Intel needed the 80286 (1982) to still run 8086 code.
Protected mode (80286, 1982) and virtual memory (80386, 1985) were Intel's response to the growing need for multitasking and memory protection. The 386 introduced paging, which gave each process its own virtual address space. The constraint was that existing real-mode software needed to keep working, so the processor boots in real mode and must be explicitly switched. This is why your bootloader has to perform the protected mode transition: a ceremony imposed by decisions made 40 years ago.
The interrupt model dates to the earliest microprocessors. An interrupt is the hardware's way of saying "something happened that cannot wait." The specific mechanism you will implement (IDT, interrupt vectors, PIC/APIC) evolved from the 8086's simple interrupt table to the APIC system designed for multiprocessor systems in the 1990s. The fundamental concept has not changed. The implementation complexity has increased dramatically.
Multics (1965) and Unix (1969) established the operating system concepts you will implement: process isolation, virtual memory, preemptive scheduling, system calls as the boundary between user and kernel space. Unix's specific genius was simplicity: where Multics was ambitious and complex, Unix was small and sharp. The xv6 teaching OS you will study as a reference is a direct descendant of this tradition.
What has changed:
Memory is vast. Early operating systems managed kilobytes. Your kernel will manage gigabytes. The algorithms are similar but the failure modes are different.
Multicore is standard. The original scheduling algorithms assumed a single processor. Modern systems must reason about cache coherence, memory ordering, and true parallelism, not just time-sliced concurrency. Your kernel will be single-core, but understanding that this simplification exists is important.
Hardware virtualization is native. Modern processors have hardware support for running multiple operating systems simultaneously (VT-x, AMD-V). QEMU can use these features, which is why your development cycle will be fast.
Security is a first-class concern. The original Unix trust model assumed trusted users. Modern kernels must defend against malicious user-space programs, side-channel attacks, and hardware vulnerabilities (Spectre, Meltdown). Your tiny kernel will not address these, but knowing they exist changes how you think about the boundary between user and kernel space.
Specifications:
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3: System Programming Guide. Read: Chapter 2 (System Architecture Overview), Chapter 3 (Protected-Mode Memory Management), Chapter 6 (Interrupt and Exception Handling), and Chapter 9 (Processor Management and Initialization). Have this open while you code.
Multiboot Specification (v1 or v2). This is what GRUB expects from your kernel binary.
Textbooks and references:
Arpaci-Dusseau, R. and Arpaci-Dusseau, A. "Operating Systems: Three Easy Pieces." Read the chapters on: virtualization of memory (address spaces, paging, TLBs), concurrency (locks, condition variables), and scheduling. Free online.
xv6 source code and the xv6 book (Russ Cox, Frans Kaashoek, Robert Morris). Read as a reference for how interrupt handling, memory management, and scheduling work in a real (small) kernel. Do not copy.
Ritchie, D. and Thompson, K. "The UNIX Time-Sharing System" (1974). The original Unix paper. Short. Read it to understand the philosophical commitments that shaped operating system design for the next 50 years.
OSDev Wiki: Practical reference for x86_64 specifics (GDT layout, IDT setup, PIC/APIC programming, page table formats).
Get a minimal kernel running in QEMU.
Stack: x86_64, C, NASM or GAS assembly, GRUB bootloader, GNU toolchain or Clang/LLVM
Build:
- Boot via GRUB in QEMU
- Print to screen (VGA text mode) or serial console
- Set up Global Descriptor Table (GDT) and Interrupt Descriptor Table (IDT), referencing the Intel manual directly
- Handle timer interrupt (PIT or APIC timer) and keyboard interrupt (PS/2)
- Physical memory manager: bitmap or free-list based page allocator
- Basic virtual memory: set up page tables, map kernel memory, understand the 4-level paging structure
Key concepts to internalize:
- What actually happens between power-on and your first line of C
- What an interrupt really is (not the abstraction, the hardware event: CPU stops, saves state, jumps to your handler)
- Why memory management is the kernel's most consequential job
- The difference between physical and virtual addresses and why the mapping between them is the foundation of process isolation
Add the ability to run multiple tasks and manage them.
Build:
- Simple round-robin scheduler: two or more tasks, context switching between them
- Stack allocation per task
- Yield mechanism (cooperative at first, then preemptive via timer interrupt)
- Force a race condition on purpose: two tasks writing to shared state without synchronization. Observe the corruption.
- Implement a basic spinlock. Fix the race condition. Understand what the lock costs.
- Minimal shell: accept keyboard input, dispatch simple commands
Key concepts to internalize:
- What a context switch actually involves at the register level
- Why preemptive scheduling is fundamentally harder than cooperative
- What a race condition feels like from the inside
- Why concurrency bugs are so hard to reproduce and debug
Crash your own kernel:
- Feed malformed input through the keyboard handler
- Try to trigger a double fault or triple fault
- Create scheduling scenarios that deadlock
- Try to corrupt memory from one task and observe what happens to the other
- Exhaust your physical memory allocator and see what breaks
- Try to break out of your virtual memory mapping
Document everything in BREAK-IT.md.
Weeks 9-10
Without rigor, everything in Phase 5 becomes vibes. The harness is what makes modern AI work measurable rather than impressionistic.
Evaluation of intelligent systems is a surprisingly old problem with a surprisingly poor track record.
The Turing Test (1950) framed evaluation as behavioral indistinguishability: can a machine fool a human judge? This set a precedent that shaped AI evaluation for decades: judge the output, not the process. The constraint was philosophical: Turing wanted to sidestep the question of whether machines "really" think. The consequence is that most AI evaluation still focuses on output quality while ignoring process integrity.
BLEU scores (Papineni et al., 2002) brought automated evaluation to machine translation. The constraint was scale: human evaluation was too expensive for rapid iteration. BLEU correlated well enough with human judgment to be useful, but it also created a well-documented pattern: systems optimized for BLEU scores that produced translations no human would accept. This is the first clear example of Goodhart's Law in AI evaluation: when a measure becomes a target, it ceases to be a good measure.
The HELM benchmark (2022) and the LLM-as-judge paradigm (2023) represent the current state of the art. HELM attempted comprehensive evaluation across many dimensions. LLM-as-judge outsourced evaluation to the models themselves, which is elegant but introduces new failure modes (position bias, verbosity bias, self-enhancement bias). The constraint is the same one that created BLEU: human evaluation does not scale.
What has changed:
The systems being evaluated are stochastic. Traditional software testing assumes deterministic behavior. LLMs produce different outputs for the same input depending on temperature, sampling, and context. This makes regression detection fundamentally harder.
The failure modes are semantic, not syntactic. A model that produces grammatically perfect, confidently stated, entirely fabricated information is not caught by any traditional testing methodology. The gap between syntactic correctness and semantic truth is the central evaluation challenge.
The models can evaluate themselves, sort of. LLM-as-judge works surprisingly well in many cases, but it introduces circular dependencies: you are using the thing you are trying to evaluate as part of the evaluation. Understanding where this works and where it breaks is critical.
The Turing Test framing persists. Most evaluation still asks "does the output look right?" rather than "was the process that produced the output trustworthy?" This is the gap that provenance-oriented thinking can address.
Papers:
Liang, P. et al. "Holistic Evaluation of Language Models" (HELM, 2022). Read for methodology. Pay attention to how they decompose evaluation into scenarios, metrics, and adaptations.
Zheng, L. et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (2023). Understand the methodology, agreement rates with human judgment, and failure modes.
Bowman, S. "Eight Things to Know about Large Language Models" (2023). Sober overview of what is actually known vs. assumed.
Papineni, K. et al. "BLEU: A Method for Automatic Evaluation of Machine Translation" (2002). Read this as a case study in what happens when a metric becomes a target.
Reference implementations to study (architecture, not code to copy):
OpenAI Evals framework.
EleutherAI lm-evaluation-harness.
Inspect AI (UK AISI).
Build a repeatable evaluation runner.
Build:
- Configuration format: define a suite of test cases (input prompt, expected behavior, scoring criteria)
- Runner: execute each test case against one or more models (API-backed is fine)
- Output capture: store full input, output, metadata, timing
- Scoring: implement at least two scoring methods (exact match, LLM-as-judge)
- Results storage: write results to your append-only store from Phase 1 (second integration point)
- Determinism: control temperature, seed where possible, flag non-deterministic runs explicitly
Key concepts to internalize:
- The difference between a test and an eval (tests have right answers, evals have judgment)
- Why reproducibility is hard with stochastic systems
- What "regression" means when your baseline is probabilistic
- The cost of evaluation (every eval run costs time and money, so design matters)
Extend the harness into something that tracks change over time.
Build:
- Regression detection: compare current run to baseline, flag degradation
- Model comparison: run the same suite against multiple models, produce comparison
- Visualization: even a simple terminal table showing pass/fail/score across runs
- Failure categorization: not just "wrong" but "wrong how" (hallucination, refusal, format error, reasoning error)
- Export: produce a clean report of findings
Key concepts to internalize:
- Why aggregate scores hide the most important failures
- The difference between "the model got better on average" and "the model stopped failing on the thing that matters"
- How evaluation design shapes what you can even notice
- The gap between what metrics capture and what human judgment catches
Game your own scoring:
- Find inputs that produce high scores on your metrics but are obviously wrong to a human
- Find inputs that produce low scores but are actually good answers
- Try to overfit a model's behavior to your eval without improving real quality
- Identify what your harness cannot measure
- Test whether your LLM-as-judge evaluator has the biases the papers describe
Document the gap between metric and judgment in BREAK-IT.md.
Weeks 11-12
You now have storage, networking, systems, and evaluation foundations. This phase builds on all of them. Without the prior phases, agent work degenerates into API glue.
The lineage of language models and autonomous agents stretches back further than most practitioners realize, and the recurring pattern is instructive: each generation confuses fluency with understanding, and each generation's evaluation tools are inadequate to distinguish them.
Shannon's "A Mathematical Theory of Communication" (1948) established that language has statistical structure that can be modeled. His n-gram experiments showed that higher-order statistical models produce increasingly convincing text. The insight that structure can emerge from statistics without semantics is the intellectual ancestor of everything you will build in this phase.
ELIZA (Weizenbaum, 1966) demonstrated that superficial pattern matching could produce conversation convincing enough that users attributed understanding to the system. Weizenbaum was alarmed by this. His constraint was minimal: simple substitution rules. The output was surprisingly persuasive. The gap between mechanism and perceived capability is the same gap you will encounter with modern agents.
The transformer architecture (Vaswani et al., 2017) made modern LLMs possible by solving the long-range dependency problem that limited earlier architectures. The key innovation, self-attention, allowed the model to relate any position in a sequence to any other position. The constraint that drove this was parallelizability: recurrent networks were sequential and therefore slow to train. The transformer traded sequential processing for parallel computation, which happened to scale beautifully on GPU hardware. This was partly by design and partly by luck.
ReAct (Yao et al., 2023) formalized the pattern of interleaving reasoning traces with actions, creating the modern agent paradigm. The constraint was that pure chain-of-thought reasoning could not interact with the world. The solution was to let the model alternate between thinking and acting. This works, but it also creates a new failure mode: the model's self-reported reasoning may not correspond to what actually drove its actions.
What has changed:
Scale has created emergent capabilities that are not well understood. Models trained on enough data exhibit behaviors that were not explicitly trained. This makes evaluation harder because you cannot predict the failure modes from the training process alone.
Tool use changes the failure surface. A model that only generates text can only fail by generating bad text. A model that calls tools can fail by calling the wrong tool, calling the right tool with wrong arguments, misinterpreting tool results, or taking irreversible actions based on stale data. The failure space is combinatorially larger.
The line between "reasoning" and "pattern matching" is not clear. This is not a new confusion (ELIZA demonstrated it in 1966), but the scale of modern models makes the confusion more consequential. Your agent will appear to reason. Whether it actually reasons is a question your eval harness from Phase 4 should help you investigate, not assume.
Papers:
Sennrich, R. et al. "Neural Machine Translation of Rare Words with Subword Units" (2016). The BPE paper. Read before implementing your tokenizer.
Vaswani, A. et al. "Attention Is All You Need" (2017). Read the abstract, introduction, and Section 3 (Model Architecture) at minimum.
Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models" (2023). Read before building your agent loop.
Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023). Understand where tool use introduces new failure modes.
Shannon, C. "A Mathematical Theory of Communication" (1948). Not required in full, but reading Part 1 gives you the intellectual foundation for everything that follows.
Weizenbaum, J. "ELIZA: A Computer Program for the Study of Natural Language Communication Between Man and Machine" (1966). Short. Read it as a warning about confusing fluency with understanding.
Anthropic. "Model Card and Evaluations for Claude Models" (most recent). Read for calibration on how a frontier lab describes its own model's capabilities and limitations.
Reference implementations to study:
Karpathy, A. minBPE. Read the code before writing your own.
LangChain or similar agent framework source code. Do not use it. Read for architecture.
Build a small but complete pipeline from data to inference to evaluation.
Build:
- Tokenizer: implement BPE from scratch on a small corpus, referencing the Sennrich paper directly
- Data pipeline: curate a small dataset, clean it, format it for training or few-shot use
- Inference wrapper: structured API calls with retry, timeout, token counting, cost tracking. Hand-written, not a library.
- Evaluation: run your dataset through the model, score with your harness from Phase 4
- Comparison: test at least two models on the same dataset and analyze where they diverge
Key concepts to internalize:
- What tokenization actually does to meaning (why "tokenization is not neutral")
- The relationship between data quality and output quality
- Why inference infrastructure matters (latency, cost, reliability)
- Where models fail in ways that are invisible without structured evaluation
Extend the pipeline into an agent with explicit state, tools, and accountability.
Build:
- Agent loop: observe, decide, act, record. Implement the ReAct pattern from the paper, not from a framework.
- Tool use: give the agent 2-3 simple tools (file read, web fetch, calculator, or similar). Implement the tool interface yourself.
- Explicit state: the agent's working memory is a visible, inspectable data structure, not hidden context
- Decision logging: every decision the agent makes is recorded with reasoning, stored in your append-only log from Phase 1 (third integration point)
- Evaluation: run the agent through structured tasks using your harness from Phase 4 (fourth integration point)
- Constraint enforcement: the agent has boundaries it cannot cross, test that they hold
Key concepts to internalize:
- The difference between an agent that appears to reason and one that actually tracks its own state
- Where "tool use" introduces failure modes that pure text generation does not have
- Why decision logging changes agent behavior (observability as constraint)
- The difference between autonomy and pseudo-autonomy
Find the silent failures:
- Give the agent tasks where it appears to succeed but actually made unjustified assumptions
- Try to make the agent take actions its constraints should prevent
- Find cases where the decision logs do not surface the real reason for a failure
- Identify where the agent confabulates competence
- Test what happens when tools fail (timeout, bad data, contradictory results)
- Compare the agent's self-reported reasoning to what your harness reveals about its actual behavior
Document everything in BREAK-IT.md.
Phase 1's append-only store is used for logging in Phase 2, storing eval results in Phase 4, and recording agent decisions in Phase 5. Phase 4's eval harness is used to evaluate models in Phase 5. By the end, you have not built five isolated projects. You have built a connected system where each layer relies on the one before it.
One substantial article per phase. The article is the argument. The code is the evidence.
Structure for each article:
- What you set out to build and why
- The historical context: what constraints shaped the original systems, what has changed, and what that change means
- What primary sources you read and what they taught you
- What you built and how it works, with enough detail that someone could follow your reasoning
- What happened on break-it day: what attacks you tried, what survived, what did not
- The distinctive insight: what you see from your specific vantage point that a generic tutorial would miss
- How it connects to the larger question of where meaning, integrity, and trust degrade in systems
At the end of 12 weeks, you should be able to answer these questions with hard-won specificity, not textbook generality:
- Where does integrity actually break in an append-only system?
- What can a firewall not protect against, and why?
- What does a race condition feel like to debug?
- What is the gap between a metric and a judgment?
- Where does an agent pretend to reason?
And one more:
- Which constraints that shaped these systems have expired, and what does that make possible?
If you can answer those from direct experience, with code that proves it and writing that articulates it, the curriculum worked.