feat: seed the base static-analysis pkl so PR head analysis runs incrementally#18
Merged
Conversation
…s pkl When the base analysis comes from a committed .codeboarding/analysis.json, the head run gets no static_analysis.pkl — and the engine's incremental path hard-requires the pkl's cluster baseline, so every head analysis fell back to a full (LLM) run. 'seed' builds that pkl deterministically: LSP indexing + build_all_cluster_results (the same clustering call the full run's abstraction agent makes), then an explicit save AFTER clustering — the engine's own teardown save happens pre-clustering and would recreate the cluster-less pkl this fixes. No LLM key is needed or read.
…ath has no cached one Every observed action run took the committed-baseline fast path, which extracts only analysis.json — so the head run could never warm-start and always fell back to a full LLM analysis (IncrementalCacheMissingError / BaselineUnavailableError in the engine). The new step builds the pkl+sha pair via 'cb_engine.py seed' on a worktree of the base SHA, fail-open: on any failure it removes partial artifacts, warns, and the run behaves exactly as before. Also widen the base-artifact cache save to cover the seeded case (it was gated on committed == 'false', which starved the cache forever for repos with committed baselines), gated on seed_ok so a failed seed is retried on the next run instead of being cached away.
…e path Keeps the local harness faithful to action.yml (its stated purpose) — the committed-baseline branch had the same bug: no pkl, so the head run always fell back to a full analysis.
…behavior The doc claimed a pkl cache miss 'just falls back to a cold LSP pass' with the PR 'still diffing correctly' — the assumption the incremental-fallback bug hid behind. The engine refuses cluster-driven incremental without the pkl's cluster baseline; document the seed step that now guarantees it.
c42940e to
880710d
Compare
Member
Author
|
/codeboarding |
Contributor
Architecture review · analyzing…⏳ CodeBoarding is analyzing the architecture changes in this PR. This usually takes a few minutes. codeboarding-action · run 27258611232 |
Contributor
Architecture review · 3 components changedgraph LR
n_Remote_Job_Orchestrator["Remote Job Orchestrator"]
del_Environment_Context_Resolver["Environment #amp; Context Resolver"]
del_Artifact_Result_Processor["Artifact #amp; Result Processor"]
n_Remote_Job_Orchestrator -- "manages job configuration and polls for status…" --> n_Remote_Job_Orchestrator
del_Environment_Context_Resolver -- "passes resolved repository URL and branch to in…" --> n_Remote_Job_Orchestrator
n_Remote_Job_Orchestrator -- "hands off JSON response containing full job res…" --> del_Artifact_Result_Processor
del_Artifact_Result_Processor -- "writes extracted documentation data to runner f…" --> del_Artifact_Result_Processor
classDef added fill:#1f883d,stroke:#0b5d23,color:#ffffff;
classDef modified fill:#bf8700,stroke:#7d4e00,color:#ffffff;
classDef deleted fill:#cf222e,stroke:#82071e,color:#ffffff,stroke-dasharray:5 3;
class n_Remote_Job_Orchestrator modified;
class del_Environment_Context_Resolver,del_Artifact_Result_Processor deleted;
linkStyle 1,2,3 stroke:#82071e,stroke-width:2px,stroke-dasharray:5 3;
Colors indicate component changes compared to See this architecture in your editor: Open in VS Code → codeboarding-action · run 27258638860 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every action run on a repo with a committed
.codeboarding/analysis.jsonfalls back to a full LLM analysis on the PR head — incremental never runs. Verified across all recent org runs (e.g. CodeBoarding run 27228524515, the action's own dogfood run 27104012564, vscode 27159035253, DashBoard 27159138136):Root cause: the engine's incremental path (v0.12.x) hard-requires the base
static_analysis.pklwith a populated cluster baseline next toanalysis.json. The committed-baseline path extracts only the JSON (git show BASE_SHA:.codeboarding/analysis.json), and the base-artifact cache save was gatedcommitted == 'false'— so repos with committed baselines could never obtain a pkl by any path. The head run then cold-LSPs, gets an empty cluster snapshot, raisesIncrementalCacheMissingError, andcb_engine.pyfalls back torun_full. (The error text "pkl loaded but has no cluster baseline" is misleading: that pkl is the one the head run itself just flushed pre-clustering.)Fix
The cluster baseline is deterministic — LSP indexing + Leiden clustering, no LLM calls — so the action can manufacture it:
cb_engine.py seed:get_static_analysis→build_all_cluster_results(the same clustering call the full run's abstraction agent makes) → explicitStaticAnalysisCache.saveafter clustering (the engine's own teardown save happens pre-clustering and would recreate the cluster-less pkl).seedbasestep inaction.yml, gated oncommitted == 'true' && cache miss, running on a worktree of the base SHA. Fail-open: on any failure it warns, removes partial pkl/sha, and the run behaves exactly as before. No LLM key is read.seed_okso a failed seed retries next run instead of poisoning the base-SHA cache key.run_local.shmirrors the step;COMMIT_STRATEGY.md's incorrect "cache miss just falls back to a cold LSP pass" claim corrected.Verification
TestSeed): analyze→cluster→save ordering, error propagation, argparse wiring.leidencluster cache +v2\n<base-sha>tag.run_local.shwith real LLM runs, both paths:incremental_seeded,n_changed: 1for a one-function PR (component IDs stay stitched) ✅Known limits / follow-ups
metadata(CodeBoarding-wrapper, licensing-aws) still fall back — they failBaselineUnavailableErrorbefore the pkl is consulted; separate fix (metadata injection or baseline regeneration).cb_engine.pyimports two engine-internal symbols; stable whileengine_refis pinned at v0.12.0 — a publicseed_baselineengine API should replace this at the next engine bump.