Add deterministic changelog-draft scripts (QUALITY-944)#13282
Add deterministic changelog-draft scripts (QUALITY-944)#13282warp-dev-github-integration[bot] wants to merge 2 commits into
Conversation
Add three new Python scripts (stdlib only) and tests to replace free-form agent steps in the changelog-draft skill: - resolve_release_range.py (Step 1): deterministically resolves the previous release cut given a release tag and channel. Validates tag format, channel match, and head tag existence; picks the greatest prior _00 cut with a different date-time prefix. - classify_pr.py (Step 6): two-pass deterministic classifier. Pass 1 applies mechanical exclusion rules (bot authors, CI/test/docs/tooling- only files, channel feature-flag gates) and emits candidates for agent subjective classification. Pass 2 validates and merges agent answers, enforcing that mechanical excludes always win over agent choices. - assemble_changelog.py (Steps 7+8 merged): assembles changelog-draft.md and changelog-draft.json from all intermediate artifacts. Enforces the accounting invariant (entries + skipped + needs_review == total PRs) and exits non-zero on duplicate PR numbers or missing buckets. Also: - Update changelog-draft/SKILL.md to call the three new scripts at Steps 1, 6, and 7+8; keep Step 9 convert_to_release_json.py unchanged. - Reframe classify-changelog-pr/SKILL.md as subjective-candidate guidance only; remove bot/CI/flag rules that are now script-enforced. - Add tests/: 68 stdlib unittest tests covering all spec validation criteria (temp-git fixtures, real-tag smoke test, rule coverage, enforcement test, subjective merge test, golden-file assembly tests, converter compatibility, py_compile). Co-Authored-By: Oz <oz-agent@warp.dev>
There was a problem hiding this comment.
Overview
This PR adds three deterministic Python scripts to replace free-form agent steps in the changelog-draft skill, along with 68 stdlib unittest tests and updated SKILL.md files. The overall design is solid, all tests pass, and CI is green.
All acceptance criteria from the spec are met:
resolve_release_range.pycorrectly encapsulates the Step 1 git-tag logicclassify_pr.pytwo-pass design enforces mechanical excludes over agent answersassemble_changelog.pycorrectly merges Steps 7+8 with accounting invariantclassify-changelog-pr/SKILL.mdproperly reframed as subjective-candidate guidanceconvert_to_release_json.pycompatibility verified by tests
Two important code quality issues to fix before merge, plus a few nits.
Concerns
IMPORTANT — Dead candidate dict in classify_one_pr: The function builds a candidate dict (lines 324–338 of classify_pr.py) but always returns None. The variable is computed and immediately discarded. Both run_pass1 and run_pass2 then re-derive the same candidate data from scratch, duplicating the flag-detection and annotation logic. This is confusing because the function signature (dict | None) and docstring suggest it could return the candidate, but it never does. If a new candidate field is added to the classify_one_pr block, callers won't pick it up automatically. The fix is to either: (a) return the candidate dict when it needs agent judgment (changing callers to use the return value), or (b) drop the dead assignment and keep all candidate-building in run_pass1/run_pass2.
IMPORTANT — Community section synthesizes PR URLs from pr_number: The SKILL.md constraint says "If url is empty, omit the PR link from user-facing markdown rather than synthesizing one." The community contributor section in _build_markdown (around line 406 of assemble_changelog.py) hardcodes https://github.com/warpdotdev/warp/pull/{pn} from the PR number instead of using the url already stored on each entry. Since ext_contributors only stores PR numbers (not URLs), the URL must be reconstructed. In practice this always produces the correct URL for public warp PRs, but it violates the spec contract and would break if a contributor's PR URL were empty or non-standard. Fix: store (pr_number, url) pairs in ext_contributors so the community section can use the stored URL (and omit the link when empty).
Found: 0 critical, 2 important, 1 suggestion, 2 nits
Approve with nits — The implementation is functionally correct and the spec criteria are met. The IMPORTANT issues are code quality concerns that should be addressed before merge but don't affect the current correctness of the skill workflow.
| candidate: dict = { | ||
| "pr_number": pr_number, | ||
| "title": pr.get("title", ""), | ||
| "url": pr.get("url", ""), | ||
| "author": author, | ||
| "changed_files": changed_files, | ||
| "source_repo": pr.get("source_repo", ""), | ||
| } | ||
| if detected_flags: | ||
| candidate["detected_feature_flags"] = detected_flags | ||
| if uncertain_flag_touch: | ||
| candidate["uncertain_flag_touch"] = True | ||
| if author in unknown_bucket: | ||
| candidate["unknown_contributor"] = True | ||
|
|
||
| return None # signal: this PR needs agent judgment |
There was a problem hiding this comment.
candidate dict — computed but never returned or used.
The function builds candidate (and annotates it with flags/unknown-contributor info), then returns None. Both run_pass1 and run_pass2 re-derive the same data from scratch by re-running detect_feature_flags and re-checking the same conditions. This creates a maintenance hazard: if a new annotation field is added here, callers won't pick it up.
Two clean options:
- Return the candidate: change the return type to
dict | dict[str, object]and return the candidate; callers use it directly instead of rebuilding. - Drop the dead block: remove lines 324–343 from this function and keep all candidate-building solely in the callers.
Option 2 is simpler — the function's single responsibility is detecting mechanical excludes.
| for author, pr_numbers in sorted(ext_contributors.items()): | ||
| pr_links = ", ".join( | ||
| f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})" | ||
| if pr_numbers | ||
| else "" | ||
| for pn in sorted(pr_numbers) | ||
| ) | ||
| lines.append(f"- {github_profile_link(author)} — {pr_links} ✨") |
There was a problem hiding this comment.
pr_number instead of using the stored url.
The SKILL.md contract ("If url is empty, omit the PR link … rather than synthesizing one") applies here too. The entries list already carries a url field; but ext_contributors only stores pr_number, so the URL must be reconstructed.
Suggested fix: store (pr_number, url) pairs in ext_contributors and use the stored URL when present (omitting the link if it's empty):
| for author, pr_numbers in sorted(ext_contributors.items()): | |
| pr_links = ", ".join( | |
| f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})" | |
| if pr_numbers | |
| else "" | |
| for pn in sorted(pr_numbers) | |
| ) | |
| lines.append(f"- {github_profile_link(author)} — {pr_links} ✨") | |
| for author, pr_entries in sorted(ext_contributors.items()): | |
| pr_links = ", ".join( | |
| f"[#{pn}]({url})" if url else f"#{pn}" | |
| for pn, url in sorted(pr_entries) | |
| ) |
This also requires changing how ext_contributors is populated:
ext_contributors.setdefault(a, [])
if pn not in (e[0] for e in ext_contributors[a]):
ext_contributors[a].append((pn, entry.get("url") or ""))| lines.append("") | ||
|
|
||
| if has_reporters: |
There was a problem hiding this comment.
🧹 [NIT] if pr_numbers is always True inside this generator — it's the outer list variable, which is guaranteed non-empty since we're iterating over sorted(pr_numbers). The else "" branch is dead code.
| lines.append("") | |
| if has_reporters: | |
| f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})" | |
| for pn in sorted(pr_numbers) |
| print( | ||
| f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, " | ||
| f"{n_review} needs_review ({total} total PRs)" | ||
| ) | ||
| print(f" wrote: {md_path}") | ||
| print(f" wrote: {json_path}") |
There was a problem hiding this comment.
💡 [SUGGESTION] Summary goes to sys.stdout but classify_pr.py routes its summary to sys.stderr when --output is used (keeping stdout clean for data). Since assemble_changelog.py always writes to files, the summary is a progress message and should go to sys.stderr for consistency.
| print( | |
| f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, " | |
| f"{n_review} needs_review ({total} total PRs)" | |
| ) | |
| print(f" wrote: {md_path}") | |
| print(f" wrote: {json_path}") | |
| n_entries = len(draft_dict.get("entries", [])) | |
| n_skipped = len(draft_dict.get("skipped", [])) | |
| n_review = len(draft_dict.get("needs_review", [])) | |
| total = n_entries + n_skipped + n_review | |
| print( | |
| f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, " | |
| f"{n_review} needs_review ({total} total PRs)", | |
| file=sys.stderr, | |
| ) | |
| print(f" wrote: {md_path}", file=sys.stderr) | |
| print(f" wrote: {json_path}", file=sys.stderr) |
| def _run(args: list[str], repo_dir: str | None = None) -> subprocess.CompletedProcess: | ||
| """Run the resolve_release_range.py script with given args.""" | ||
| cmd = [sys.executable, _SCRIPT] + args | ||
| return subprocess.run(cmd, capture_output=True, text=True) |
There was a problem hiding this comment.
🧹 [NIT] The repo_dir parameter is accepted but never used — all callers embed --repo-dir inside the args list instead. Either remove the unused parameter or actually pass it through as cwd to subprocess.run.
…ITY-944) classify_pr.py: - classify_one_pr now returns the candidate dict (source='candidate') instead of None, so the value it built is actually used - run_pass1 and run_pass2 now use det.get('source') == 'deterministic' to distinguish excludes from candidates, removing duplicate candidate-building code that both callers had assemble_changelog.py: - ext_contributors now stores {pr_number, url} dicts instead of bare ints - Community > Contributors section uses the stored url field; omits the hyperlink (shows plain #N) when url is empty, per the spec's 'no synthesized PR URLs' constraint Tests: - TestClassifyOnePrReturnValue: verify classify_one_pr returns a non-None dict with source='candidate' and expected fields for agent-required PRs - TestCommunityContributorUrls: verify stored URL is used in Contributors section, and that no synthesized URL appears when url is empty Co-Authored-By: Oz <oz-agent@warp.dev>
There was a problem hiding this comment.
Overview
This PR adds three deterministic Python scripts (resolve_release_range.py, classify_pr.py, assemble_changelog.py) plus a comprehensive test suite (72 tests) to the changelog-draft skill, replacing the ad-hoc in-skill logic with reproducible, independently testable steps.
Prior Issues — Confirmed Resolved
Issue 1 — classify_one_pr returning None instead of the candidate dict: Fixed. classify_one_pr now builds the candidate dict and returns it with source='candidate' (lines 329–345 of classify_pr.py). The duplicated dict-building code in the callers (run_pass1, run_pass2) has been removed; both now branch on det.get("source") == "deterministic" and route accordingly. Regression test TestClassifyOnePrReturnValue.test_candidate_pr_returns_dict_not_none directly validates this fix.
Issue 2 — Community section synthesizing PR URLs from pr_number: Fixed. The Community > Contributors section now uses e['url'] from the stored entry and falls back to plain #N (no hyperlink) when the URL is empty (lines 406–409 of assemble_changelog.py). Regression tests TestCommunityContributorUrls.test_community_section_uses_stored_url and test_community_section_omits_hyperlink_when_url_empty confirm both cases.
Test Results
72/72 tests pass.
Full Rubric Review
Correctness: ✅ Both prior bugs are fixed. run_pass2 correctly re-runs deterministic checks and reads flag annotations from the returned candidate dict. The seen dict in assemble correctly enforces the one-bucket invariant.
Standards/style: ✅ Clean module-level constants, good docstrings, consistent helper factoring.
Complexity/clarity: ✅ Two-pass design is well-structured and documented.
Naming: ✅ Function and variable names are consistent and descriptive.
Comments/docs: Minor only — the classify_pr.py module-level docstring summary schema shows included and needs_review fields (pass 2 only) and omits skipped_explicit_entries (present in both passes). No functional impact.
Tests: ✅ Regression tests for both prior issues are correctly isolated and clearly assertive. _make_pr, _pass1, _pass2, _assemble helpers reduce duplication. TestPySyntax.test_py_compile_all_scripts provides a fast syntax gate on all scripts.
Security: ✅ subprocess.run calls use list arguments (no shell=True) and handle non-zero exits explicitly.
Minor Nit (non-blocking)
The classify_pr.py module-level docstring summary schema (lines 64–70) includes included and needs_review (pass 2 fields) but omits skipped_explicit_entries (present in both passes). The runtime behavior is correct; this is a documentation-only inconsistency. Consider updating the docstring to reflect the actual per-pass schema.
Concerns
- Found: 0 critical, 0 important, 1 nit
Verdict
Both prior blocking issues are confirmed resolved with regression tests. All 72 tests pass. The implementation matches the approved spec.
Approve
Summary
Add three deterministic Python scripts (stdlib only) to the
changelog-draftskill to replace free-form agent steps, making changelog generation faster and more reliable.Changes
New scripts:
resolve_release_range.py(Step 1): determines the previous release cut given a release tag + channelclassify_pr.py(Step 6): two-pass deterministic classifier with mechanical exclusion rules (bots, CI/test/docs/tooling files, channel flag gates); enforces that mechanical excludes win over agent answersassemble_changelog.py(Steps 7+8): assembleschangelog-draft.md+changelog-draft.json; enforces accounting invariantUpdated SKILL.md files:
changelog-draft/SKILL.md: Steps 1/6/7+8 call the new scripts; Step 9 unchangedclassify-changelog-pr/SKILL.md: reframed as subjective-candidate guidance onlyNew tests (72 stdlib unittest tests):
test_resolve_release_range.py: temp-git fixtures + real-tag smoke testtest_classify_pr.py: all rule scenarios + enforcement test + subjective merge testtest_assemble_changelog.py: golden-file tests + accounting invariant + converter compatibility + py_compileReview rework (cycle 1)
Addressed two issues raised in review:
1. Dead
candidatedict inclassify_one_pr(classify_pr.py):classify_one_prbuilt a candidate dict then returnedNone, discarding it — while both callers (run_pass1andrun_pass2) duplicated the same dict-building logic. Fixed by returningcandidate(taggedsource='candidate') instead ofNone; callers now distinguish deterministic excludes vs. candidates viadet.get('source') == 'deterministic', removing all duplicated code.2. Community section synthesized PR URLs (
assemble_changelog.py):The Contributors sub-section synthesized URLs as
https://github.com/warpdotdev/warp/pull/{pn}from the PR number, violating the spec's "no synthesized PR URLs" constraint. Fixed by storing{pr_number, url}dicts inext_contributorsand using the storedurlfield; whenurlis empty the PR reference appears as plain#Nwith no hyperlink.Regression tests added for both fixes.
Verification
python3 -m unittest discover -s .agents/skills/changelog-draft/tests./script/presubmitskipped: changes are purely Python scripts and SKILL.md files (no Rust source modified); narrower checks above cover the full scope.Conversation: https://staging.warp.dev/conversation/ba0d7066-ac2f-44ce-a015-c4200ce07323