Explain Compiler — parse Clang/LLVM .opt.yaml optimization remark streams, normalize them into a stable schema, and drive summary, stats, diff, export, check, explain, evidence packs with optional source / IR / assembly context, Chapter 11 alignment pipelines (labels, packs, datasets, eval), and Chapter 12 CI feedback (stable JSON reports, semantic report-diff, policy gates, PR comments), plus digest and doctor for cache keys and masked config (Chapter 13 themes).
Companion tooling for Decode the Compiler: AI-Guided Explanations of C/C++ Optimization Logs for Real-World Performance.
The compiler already decided what to optimize, what to skip, and often why. Those decisions are recorded as YAML streams with tags such as !Missed, !Passed, and !Analysis. Treating that output as data — rather than scrolling thousands of lines by hand — is what makes performance work reproducible and teachable.
Clang can emit a machine-oriented record of optimization events tied to source locations. explncc:
- parses YAML document streams (not a single mapping),
- preserves remark kind from YAML tags,
- normalizes inconsistent
Argsintomessage,cost,threshold, and related fields without inventing data, - supports directory inputs (all
*.opt.yamlrecursively).
Chapter 10 introduces the Lossless Semantic Tree (LST) — a way to preserve and expose the compiler’s optimization reasoning without flattening it into prose or asking a model to guess from source alone.
| Word | Meaning in explncc |
|---|---|
| Lossless | No invented facts. Raw .opt.yaml stays authoritative; normalization keeps args_raw; evidence packs list gaps in missing_context. |
| Semantic | The compiler already decided pass, kind, costs, vectorization, and DebugLoc — tooling surfaces that evidence, not LLM inference from source text. |
| Tree | Structured evidence around one remark: primary node, related_records[] links to sibling remarks in the same function/log, optional context leaves (source, IR, asm). |
.opt.yaml remark (root evidence)
└── OptimizationRecord (normalized node)
└── EvidencePack (minimal semantic slice)
├── primary remark fields
├── related_records[] ← linked remarks, same function/log
└── optional context leaves
├── source_snippet (--include-source)
├── ir_snippet (--include-ir)
└── assembly_snippet (--include-asm)
Chapter extensions:
- Chapter 11 — alignment labels and
alignment-packadd classification nodes on the same tree (conservative teachers, eval rubrics). - Chapter 12 —
report/report-difftreat one.opt.yamlas a snapshot and sequences across commits as compiler-semantic history (decision drift, not source diff).
Prompt pipeline (Chapter 10 thesis):
.opt.yaml → normalized record → evidence pack → prompt template → optional explain
Model backends consume normalized records or packs, never raw YAML streams. Context attachment (--include-source, --include-ir, --include-asm) adds leaves when you have external artifacts; absent layers stay explicit.
| LST layer | explncc command |
|---|---|
| Compiler record | summary, stats, export, check, report |
| Normalized record | all commands |
| Evidence pack | explncc evidence |
| Context leaves | --include-source, --include-ir, --include-asm on evidence, alignment-pack, dataset |
| Semantic CI history | explncc report-diff |
Trust model (Chapters 10–12):
- Compiler YAML is authoritative.
- CI organizes and preserves evidence.
- Deterministic policy gates decide pass/fail.
- Models optionally assist triage (clearly labeled sections in
report).
See docs/chapter-10-notes.md for the teaching order and evidence-pack workflow.
python3.12 -m venv .venv
source .venv/bin/activate
make install-devmake examples
python -m explncc summary build/examples/ --limit 20
python -m explncc stats build/examples/vectorize_aliasing_fail/ --json
python -m explncc diff \
build/examples/inline_too_costly/before/before.opt.yaml \
build/examples/inline_too_costly/after/after.opt.yaml
python -m explncc explain build/examples/inline_miss_no_definition/main.opt.yaml --backend rule
python -m explncc export build/examples/ --format jsonl -o /tmp/out.jsonl
python -m explncc check build/examples/ --max-missed-inline 200Build deterministic evidence packs from normalized remarks — the bridge between raw .opt.yaml and downstream training or explanation.
# One pack per remark (JSONL for pipelines)
python -m explncc evidence build/examples/inline_miss_no_definition/main.opt.yaml \
--format jsonl -o /tmp/packs.jsonl
# Attach source window around DebugLoc (requires paths that resolve from --source-root)
python -m explncc evidence build/examples/vectorize_success/main.opt.yaml \
--include-source --source-root examples/vectorize_success \
--context-before 5 --context-after 8 \
--format markdown -o /tmp/pack.md
# Join external IR / assembly (Clang does not embed these in .opt.yaml)
python -m explncc evidence tests/fixtures/simd_vectorized.opt.yaml \
--include-ir --ir-file tests/fixtures/t.ll --ir-lines 50 \
--include-asm --asm-file tests/fixtures/t.s --asm-lines 60 \
--format jsonContext flags are shared with alignment-pack and dataset --focus alignment (--include-source, --include-ir, --include-asm). See src/explncc/context_snippets.py for snippet bounds and assembly mnemonic hints (movaps, vmovups, …) — conservative signals, not diagnoses.
These commands are deterministic: they do not train or call a model unless you plug the output into your own tooling.
# Heuristic slice: vectorization-related remarks (pass names, keywords, vector width field)
python -m explncc alignment build/examples/vectorize_success/ --limit 20
python -m explncc alignment build/examples/ --json | head -c 600
# Alignment evidence packs: compiler facts + labels + optional LST context leaves
python -m explncc alignment-pack examples/chapter11_alignment/ \
--format jsonl -o /tmp/alignment-packs.jsonl
python -m explncc alignment-pack examples/chapter11_alignment/aligned_intrinsic/fixtures/main.opt.yaml \
--include-source --source-root examples/chapter11_alignment/aligned_intrinsic \
--format markdown
# JSONL for fine-tuning / instruction tuning (OpenAI-style chat messages + optional metadata)
python -m explncc dataset build/examples/vectorize_aliasing_fail/ \
-o /tmp/ch11_train.jsonl \
--focus alignment \
--template guided \
--format explncc-record
# Prompt A/B fixtures + evaluator
python -m explncc bench-prompts examples/chapter11_alignment/ \
--focus alignment --templates minimal,guided,rubric,adversarial,missing-context \
-o /tmp/ch11_bench.jsonl
python -m explncc eval-alignment tests/fixtures/alignment_predictions.jsonl --format markdown
# Full fixture pipeline (no Clang required)
make chapter11See docs/chapter-11-alignment.md for the full pipeline guide and docs/chapter-11-notes.md for a short companion.
explncc report turns normalized remarks into CI artifacts (Markdown, JSON, GitHub, HTML). report-diff compares two .opt.yaml trees for compiler-semantic drift (what the optimizer decided changed), complementing source diffs. Policy gates are deterministic only — models never fail the build.
# GitHub Actions job summary (default: --no-explain, no network)
python -m explncc report build/app.opt.yaml --format markdown --title "Build remarks" \
--git-sha "$GITHUB_SHA" --branch "$GITHUB_REF_NAME" --ci-provider github \
>> "$GITHUB_STEP_SUMMARY"
# Stable JSON for dashboards (schema_version, summary, policy, metadata)
python -m explncc report build/app.opt.yaml --format json \
--git-sha "$GITHUB_SHA" --ci-provider github \
-o report.json --write-manifest manifest.json
# Collapsible PR comment body (post with gh pr comment --body-file)
python -m explncc report build/app.opt.yaml --format github --top-missed 10 -o pr-comment.md
# Deterministic gate (same thresholds as check; writes artifact even on failure)
python -m explncc report build/app.opt.yaml -o gate.md \
--fail-on-check --max-missed-inline 80 --max-missed-vectorize 20
# Semantic diff: baseline vs PR build (regression / improvement classification)
python -m explncc report-diff build/baseline/app.opt.yaml build/pr/app.opt.yaml \
--before-label main --after-label pr --format github --top-changes 15 \
-o pr-diff-comment.md
# Optional triage only when policy fails (rule backend, no raw YAML to models)
python -m explncc report build/app.opt.yaml --format markdown \
--fail-on-check --max-missed-inline 80 \
--explain-backend rule --explain-only-on-failure -o gate.md
# Stable digests over collected .opt.yaml (CI cache keys) and masked backend env
python -m explncc digest build/
python -m explncc doctorCopy-ready workflows: examples/ci/ (explncc-report.yml, explncc-gated.yml, explncc-diff-pr.yml). Full guide: docs/chapter-12-ci.md. Short checklist: docs/chapter-12-notes.md.
explncc viz emits Mermaid diagrams, HTML with Mermaid.js, or JSON for your own graph UI — all from the same normalized remarks as the rest of the tool (not from LLVM IR bitcode).
python -m explncc viz build/examples/ --style pass-summary --format mermaid --top 12 -o remarks.mmd
python -m explncc viz build/app.opt.yaml --style pass-remark --format json -o viz.json
python -m explncc viz build/app.opt.yaml --style missed-top --format html --explain-backend rule -o viz.htmlAuthor notes: docs/chapter-14-notes.md. Demo: make chapter14-demo.
Rich tables list kind, pass, remark, function, location, and a truncated message. Use --json or --jsonl for stable downstream tooling.
| Module | Role |
|---|---|
explncc/parser.py |
YAML stream loader with !Missed / !Passed / !Analysis |
explncc/normalizer.py |
Raw document → OptimizationRecord |
explncc/models.py |
Pydantic schema |
explncc/summary.py / stats.py |
Filtering and aggregates |
explncc/diffing.py |
Build-vs-build missed deltas and counters |
explncc/report_diff.py |
Semantic optimization diff for CI (report-diff) |
explncc/exporters.py |
json, jsonl, csv |
explncc/checks.py |
Deterministic CI policy thresholds |
explncc/explain/ |
Rule text + optional HTTP backends |
explncc/context_snippets.py |
Source / IR / assembly snippet extraction + asm signals |
explncc/evidence.py |
Chapter 10 evidence packs from normalized remarks |
explncc/alignment.py |
Heuristic SIMD / alignment-related remark slice + labels |
explncc/alignment_pack.py |
Chapter 11 alignment evidence packs |
explncc/prompt_templates.py |
Named Chapter 11 user prompts (minimal, guided, rubric) |
explncc/dataset_llm.py |
JSONL builders for training / bench rows |
explncc/ci_report.py |
Markdown / JSON / HTML / GitHub CI reports |
explncc/ci_manifest.py |
CI artifact manifest (--write-manifest, ci-manifest) |
explncc/report_types.py |
Stable JSON report schema + metadata types |
explncc/digest.py |
Per-file and aggregate SHA-256 over .opt.yaml inputs |
explncc/config.py |
Backend env + doctor payload |
explncc/viz.py |
Mermaid / HTML / JSON visualization bundles (viz command) |
explncc/cli.py |
Typer commands |
Subpackages stay small so a book chapter can point to one file at a time.
- Clang/LLVM
-fsave-optimization-record/-foptimization-record-file=…output (.opt.yaml) - One file or a directory tree; only
*.opt.yamlfiles are read
- Heuristics depend on Clang’s YAML shape; newer LLVM versions may add fields (handled conservatively).
alignmentslice is keyword/pass-based, not semantic analysis; validate on your corpus before publishing benchmark numbers.- Diff compares fingerprints of normalized rows; identical logical events with different wording may look distinct.
- Context attachment needs correct
--source-rootand external.ll/.sfiles; wrong paths yield empty snippets, not invented code. - Evidence / alignment packs list
missing_contextexplicitly; teachers and evaluators are conservative heuristics, not oracle labels. - AI backends augment text only; they consume normalized records or packs, not raw
.opt.yaml, and never drive CI pass/fail. dataset/bench-promptsemit structure for training; they do not guarantee your fine-tuning provider’s latest JSONL schema — verify against current API docs.reportwith explanation enabled can call remote model APIs; prefer--no-explainon high-frequency CI unless you control keys, quotas, and data-retention policy.
- Deeper remark-specific extractors (more structured fields from
Args) - Optional SARIF or LSP-adjacent bridges
- Tighter CI recipes (
explncc checkpresets)
Use the bundled examples/ to emit real .opt.yaml on your machine, then run explncc to connect source patterns to compiler vocabulary.
| Chapter | Doc |
|---|---|
| 10 — LST, evidence packs, context | chapter-10-notes.md |
| 11 — alignment pipeline, context, datasets | chapter-11-alignment.md, chapter-11-notes.md |
| 12 — CI reports, semantic diff, gates | chapter-12-ci.md, chapter-12-notes.md |
| 13–14 — architecture, viz | chapter-13-notes.md, chapter-14-notes.md |
You can — and you should, once — to see the raw stream. explncc exists so you can filter, count, diff across builds, and export the same information reliably for notes, CI, and (optionally) model-assisted prose.
- Deterministic core first — every command works without network access.
- No invented fields — missing data stays absent;
args_rawpreserves the source. - AI as augmentation — rule text is always available; HTTP backends only enrich labeled sections.
- LST context leaves — attach source/IR/asm when available; never fabricate missing layers.
- Semantic history — one
.opt.yamlis an LST snapshot; sequences across builds supportreport-diffdrift analysis.
- Ollama (local): set
OLLAMA_HOST,OLLAMA_MODEL(defaultqwen2.5-coder:7b-instruct). - OpenAI: set
OPENAI_API_KEY; optionalOPENAI_MODEL(defaultgpt-4o-mini).
brew install llvm # macOS
make examples # writes under build/examples/<name>/Details: docs/getting-started.md and docs/examples.md.
make check— ruff, format check, mypy, pytestmake docs-check— required doc files present- Prefer focused changes with tests beside
tests/fixtures/*.opt.yaml
make install-dev
make check
make demo # needs `make examples` first
make chapter11-demo PYTHON="$(pwd)/.venv/bin/python3" # alignment + bench-prompts sample
make chapter12-demo PYTHON="$(pwd)/.venv/bin/python3" # CI-style github report (fixture)make check
python -m explncc alignment tests/fixtures/simd_vectorized.opt.yaml --json
python -m explncc dataset tests/fixtures/simd_vectorized.opt.yaml -o /tmp/t.jsonl --focus all --format openai-messages --template minimal
python -m explncc bench-prompts tests/fixtures/simd_vectorized.opt.yaml --focus all --templates minimalpython -m explncc evidence tests/fixtures/simd_vectorized.opt.yaml --format json | head -c 800
python -m pytest -q tests/test_evidence.py tests/test_context_snippets.pypython -m explncc report tests/fixtures/inline_miss_no_definition.opt.yaml --format markdown
python -m explncc report tests/fixtures/inline_miss_no_definition.opt.yaml --format github | head -n 20
python -m explncc report-diff tests/fixtures/inline_miss_no_definition.opt.yaml \
tests/fixtures/inline_miss_no_definition.opt.yaml --format markdown
python -m pytest -q tests/test_ci_report.py tests/test_report_cli.py tests/test_chapter12_ci.pyMIT