feat: seed the base static-analysis pkl so PR head analysis runs incrementally by brovatten · Pull Request #18 · CodeBoarding/CodeBoarding-action

brovatten · 2026-06-10T06:32:58Z

Problem

Every action run on a repo with a committed .codeboarding/analysis.json falls back to a full LLM analysis on the PR head — incremental never runs. Verified across all recent org runs (e.g. CodeBoarding run 27228524515, the action's own dogfood run 27104012564, vscode 27159035253, DashBoard 27159138136):

Using committed .codeboarding/analysis.json at cc61434…
Incremental analysis cannot proceed: static_analysis.pkl … has no cluster baseline …
…; running full analysis on head.

Root cause: the engine's incremental path (v0.12.x) hard-requires the base static_analysis.pkl with a populated cluster baseline next to analysis.json. The committed-baseline path extracts only the JSON (git show BASE_SHA:.codeboarding/analysis.json), and the base-artifact cache save was gated committed == 'false' — so repos with committed baselines could never obtain a pkl by any path. The head run then cold-LSPs, gets an empty cluster snapshot, raises IncrementalCacheMissingError, and cb_engine.py falls back to run_full. (The error text "pkl loaded but has no cluster baseline" is misleading: that pkl is the one the head run itself just flushed pre-clustering.)

Fix

The cluster baseline is deterministic — LSP indexing + Leiden clustering, no LLM calls — so the action can manufacture it:

cb_engine.py seed: get_static_analysis → build_all_cluster_results (the same clustering call the full run's abstraction agent makes) → explicit StaticAnalysisCache.save after clustering (the engine's own teardown save happens pre-clustering and would recreate the cluster-less pkl).
New seedbase step in action.yml, gated on committed == 'true' && cache miss, running on a worktree of the base SHA. Fail-open: on any failure it warns, removes partial pkl/sha, and the run behaves exactly as before. No LLM key is read.
Cache-save gate widened to also save seeded base artifacts, guarded on seed_ok so a failed seed retries next run instead of poisoning the base-SHA cache key.
run_local.sh mirrors the step; COMMIT_STRATEGY.md's incorrect "cache miss just falls back to a cold LSP pass" claim corrected.

Verification

Unit tests (TestSeed): analyze→cluster→save ordering, error propagation, argparse wiring.
Standalone seed on a test repo: pkl with populated leiden cluster cache + v2\n<base-sha> tag.
No-LLM simulation of the head run's three fallback gates against the seeded pkl: all pass (cluster snapshot non-empty → the production raise condition no longer fires).
Full local E2E via run_local.sh with real LLM runs, both paths:
- no committed baseline: full base + incremental head (regression check) ✅
- committed baseline: seed runs (no LLM), head runs incrementally — head pkl strategy incremental_seeded, n_changed: 1 for a one-function PR (component IDs stay stitched) ✅

Known limits / follow-ups

Old-schema committed baselines without metadata (CodeBoarding-wrapper, licensing-aws) still fall back — they fail BaselineUnavailableError before the pkl is consulted; separate fix (metadata injection or baseline regeneration).
A PR touching every clustered file still prunes the cluster baseline to empty → engine refuses incremental → full fallback (rare, correct).
cb_engine.py imports two engine-internal symbols; stable while engine_ref is pinned at v0.12.0 — a public seed_baseline engine API should replace this at the next engine bump.
First run per base SHA pays a CPU-only LSP pass; later runs hit the actions/cache.

…s pkl When the base analysis comes from a committed .codeboarding/analysis.json, the head run gets no static_analysis.pkl — and the engine's incremental path hard-requires the pkl's cluster baseline, so every head analysis fell back to a full (LLM) run. 'seed' builds that pkl deterministically: LSP indexing + build_all_cluster_results (the same clustering call the full run's abstraction agent makes), then an explicit save AFTER clustering — the engine's own teardown save happens pre-clustering and would recreate the cluster-less pkl this fixes. No LLM key is needed or read.

…ath has no cached one Every observed action run took the committed-baseline fast path, which extracts only analysis.json — so the head run could never warm-start and always fell back to a full LLM analysis (IncrementalCacheMissingError / BaselineUnavailableError in the engine). The new step builds the pkl+sha pair via 'cb_engine.py seed' on a worktree of the base SHA, fail-open: on any failure it removes partial artifacts, warns, and the run behaves exactly as before. Also widen the base-artifact cache save to cover the seeded case (it was gated on committed == 'false', which starved the cache forever for repos with committed baselines), gated on seed_ok so a failed seed is retried on the next run instead of being cached away.

…e path Keeps the local harness faithful to action.yml (its stated purpose) — the committed-baseline branch had the same bug: no pkl, so the head run always fell back to a full analysis.

…behavior The doc claimed a pkl cache miss 'just falls back to a cold LSP pass' with the PR 'still diffing correctly' — the assumption the incremental-fallback bug hid behind. The engine refuses cluster-driven incremental without the pkl's cluster baseline; document the seed step that now guarantees it.

brovatten · 2026-06-10T06:49:10Z

/codeboarding

github-actions · 2026-06-10T06:49:23Z

Architecture review · analyzing…

⏳ CodeBoarding is analyzing the architecture changes in this PR. This usually takes a few minutes.

_{codeboarding-action · run 27258611232}

github-actions · 2026-06-10T06:50:01Z

Architecture review · 3 components changed

graph LR
    n_Remote_Job_Orchestrator["Remote Job Orchestrator"]
    del_Environment_Context_Resolver["Environment #amp; Context Resolver"]
    del_Artifact_Result_Processor["Artifact #amp; Result Processor"]
    n_Remote_Job_Orchestrator -- "manages job configuration and polls for status…" --> n_Remote_Job_Orchestrator
    del_Environment_Context_Resolver -- "passes resolved repository URL and branch to in…" --> n_Remote_Job_Orchestrator
    n_Remote_Job_Orchestrator -- "hands off JSON response containing full job res…" --> del_Artifact_Result_Processor
    del_Artifact_Result_Processor -- "writes extracted documentation data to runner f…" --> del_Artifact_Result_Processor
    classDef added fill:#1f883d,stroke:#0b5d23,color:#ffffff;
    classDef modified fill:#bf8700,stroke:#7d4e00,color:#ffffff;
    classDef deleted fill:#cf222e,stroke:#82071e,color:#ffffff,stroke-dasharray:5 3;
    class n_Remote_Job_Orchestrator modified;
    class del_Environment_Context_Resolver,del_Artifact_Result_Processor deleted;
    linkStyle 1,2,3 stroke:#82071e,stroke-width:2px,stroke-dasharray:5 3;

Colors indicate component changes compared to main: 🟩 Added · 🟨 Modified · 🟥 Removed

See this architecture in your editor: Open in VS Code →

_{codeboarding-action · run 27258638860}

brovatten added 4 commits June 10, 2026 08:42

feat: mirror the pkl seeding step in run_local.sh's committed-baselin…

0705dae

…e path Keeps the local harness faithful to action.yml (its stated purpose) — the committed-baseline branch had the same bug: no pkl, so the head run always fell back to a full analysis.

brovatten force-pushed the feat/seed-base-pkl branch from c42940e to 880710d Compare June 10, 2026 06:43

brovatten closed this Jun 10, 2026

brovatten reopened this Jun 10, 2026

Validate committed base analysis SHA

78c1378

brovatten merged commit 2164398 into main Jun 10, 2026
2 checks passed

github-actions Bot mentioned this pull request Jun 10, 2026

chore(main): release 1.2.0 #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: seed the base static-analysis pkl so PR head analysis runs incrementally#18

feat: seed the base static-analysis pkl so PR head analysis runs incrementally#18
brovatten merged 5 commits into
mainfrom
feat/seed-base-pkl

brovatten commented Jun 10, 2026

Uh oh!

brovatten commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brovatten commented Jun 10, 2026

Problem

Fix

Verification

Known limits / follow-ups

Uh oh!

brovatten commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Architecture review · analyzing…

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture review · 3 components changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 10, 2026 •

edited

Loading