Summary
Each dogfood run regenerates all programs from scratch, re-encountering the same LLM mistakes. Successful programs (47.5% of the last run) are discarded after verification. This wastes both the generation cost and the testing value of known-good programs.
Proposal: Treat successful dogfood outputs as a growing regression corpus. New runs focus generation budget on finding new bugs rather than re-covering old ground.
Design
Corpus accumulation
After each dogfood run:
- Programs that compiled and produced correct output are added to a permanent corpus (e.g.,
dogfood_corpus/)
- Each entry includes:
.spy source files, expected output, metadata (feature tags, complexity, generation date)
- Deduplication: skip programs that are structurally similar to existing corpus entries (same feature combination + complexity tier)
Run structure change
Current: Generate 200 new programs → compile all → triage all
Proposed: Run corpus (N existing) + generate M new programs → compile all → triage new only
Corpus as regression suite
The corpus doubles as a regression test suite:
- Run the full corpus on each compiler change (or a random sample for CI)
- If a previously-passing program breaks, it's a compiler regression — high signal
- Corpus programs can be promoted to proper test fixtures if they cover unique patterns
Growth model
- Run 1: 0 corpus + 200 new → ~95 pass → corpus = 95
- Run 2: 95 corpus + 150 new → ~70 new pass → corpus = 165
- Run 5: ~300 corpus + 100 new → focus entirely on new patterns
- Eventually the corpus covers most feature combinations and new runs target edge cases
Pruning
- Remove corpus entries that duplicate existing test fixtures
- Remove entries whose feature combinations are fully covered by newer, more complex entries
- Keep corpus size bounded (e.g., 500 programs max)
Impact
- Each run becomes progressively more valuable (diminishing-waste, not diminishing-returns)
- Corpus serves as a living regression suite for compiler changes
- Generation budget focuses on novel feature combinations
- Reduces triage burden — only new programs need investigation
Implementation
- Location:
build_tools/ + new dogfood_corpus/ directory
- Metadata format: extend existing
metadata.json with corpus fields
- Similarity detection: hash feature tags + complexity tier
- CI integration: optional
dotnet test target or standalone script
Discovered via
Dogfood analysis session 2026-03-10 — observed that 47.5% of generated programs succeed but are discarded after each run.
Summary
Each dogfood run regenerates all programs from scratch, re-encountering the same LLM mistakes. Successful programs (47.5% of the last run) are discarded after verification. This wastes both the generation cost and the testing value of known-good programs.
Proposal: Treat successful dogfood outputs as a growing regression corpus. New runs focus generation budget on finding new bugs rather than re-covering old ground.
Design
Corpus accumulation
After each dogfood run:
dogfood_corpus/).spysource files, expected output, metadata (feature tags, complexity, generation date)Run structure change
Corpus as regression suite
The corpus doubles as a regression test suite:
Growth model
Pruning
Impact
Implementation
build_tools/+ newdogfood_corpus/directorymetadata.jsonwith corpus fieldsdotnet testtarget or standalone scriptDiscovered via
Dogfood analysis session 2026-03-10 — observed that 47.5% of generated programs succeed but are discarded after each run.