Skip to content

Add deterministic changelog-draft scripts (QUALITY-944)#13282

Draft
warp-dev-github-integration[bot] wants to merge 2 commits into
masterfrom
oz-agent/QUALITY-944-deterministic-changelog-scripts
Draft

Add deterministic changelog-draft scripts (QUALITY-944)#13282
warp-dev-github-integration[bot] wants to merge 2 commits into
masterfrom
oz-agent/QUALITY-944-deterministic-changelog-scripts

Conversation

@warp-dev-github-integration

@warp-dev-github-integration warp-dev-github-integration Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Add three deterministic Python scripts (stdlib only) to the changelog-draft skill to replace free-form agent steps, making changelog generation faster and more reliable.

Changes

New scripts:

  • resolve_release_range.py (Step 1): determines the previous release cut given a release tag + channel
  • classify_pr.py (Step 6): two-pass deterministic classifier with mechanical exclusion rules (bots, CI/test/docs/tooling files, channel flag gates); enforces that mechanical excludes win over agent answers
  • assemble_changelog.py (Steps 7+8): assembles changelog-draft.md + changelog-draft.json; enforces accounting invariant

Updated SKILL.md files:

  • changelog-draft/SKILL.md: Steps 1/6/7+8 call the new scripts; Step 9 unchanged
  • classify-changelog-pr/SKILL.md: reframed as subjective-candidate guidance only

New tests (72 stdlib unittest tests):

  • test_resolve_release_range.py: temp-git fixtures + real-tag smoke test
  • test_classify_pr.py: all rule scenarios + enforcement test + subjective merge test
  • test_assemble_changelog.py: golden-file tests + accounting invariant + converter compatibility + py_compile

Review rework (cycle 1)

Addressed two issues raised in review:

1. Dead candidate dict in classify_one_pr (classify_pr.py):
classify_one_pr built a candidate dict then returned None, discarding it — while both callers (run_pass1 and run_pass2) duplicated the same dict-building logic. Fixed by returning candidate (tagged source='candidate') instead of None; callers now distinguish deterministic excludes vs. candidates via det.get('source') == 'deterministic', removing all duplicated code.

2. Community section synthesized PR URLs (assemble_changelog.py):
The Contributors sub-section synthesized URLs as https://github.com/warpdotdev/warp/pull/{pn} from the PR number, violating the spec's "no synthesized PR URLs" constraint. Fixed by storing {pr_number, url} dicts in ext_contributors and using the stored url field; when url is empty the PR reference appears as plain #N with no hyperlink.

Regression tests added for both fixes.

Verification

  • Unit tests: 72/72 pass — python3 -m unittest discover -s .agents/skills/changelog-draft/tests
  • py_compile: all scripts compile without errors ✓
  • ./script/presubmit skipped: changes are purely Python scripts and SKILL.md files (no Rust source modified); narrower checks above cover the full scope.

Conversation: https://staging.warp.dev/conversation/ba0d7066-ac2f-44ce-a015-c4200ce07323

Add three new Python scripts (stdlib only) and tests to replace free-form
agent steps in the changelog-draft skill:

- resolve_release_range.py (Step 1): deterministically resolves the
  previous release cut given a release tag and channel. Validates tag
  format, channel match, and head tag existence; picks the greatest
  prior _00 cut with a different date-time prefix.

- classify_pr.py (Step 6): two-pass deterministic classifier. Pass 1
  applies mechanical exclusion rules (bot authors, CI/test/docs/tooling-
  only files, channel feature-flag gates) and emits candidates for agent
  subjective classification. Pass 2 validates and merges agent answers,
  enforcing that mechanical excludes always win over agent choices.

- assemble_changelog.py (Steps 7+8 merged): assembles changelog-draft.md
  and changelog-draft.json from all intermediate artifacts. Enforces the
  accounting invariant (entries + skipped + needs_review == total PRs)
  and exits non-zero on duplicate PR numbers or missing buckets.

Also:
- Update changelog-draft/SKILL.md to call the three new scripts at Steps
  1, 6, and 7+8; keep Step 9 convert_to_release_json.py unchanged.
- Reframe classify-changelog-pr/SKILL.md as subjective-candidate guidance
  only; remove bot/CI/flag rules that are now script-enforced.
- Add tests/: 68 stdlib unittest tests covering all spec validation
  criteria (temp-git fixtures, real-tag smoke test, rule coverage,
  enforcement test, subjective merge test, golden-file assembly tests,
  converter compatibility, py_compile).

Co-Authored-By: Oz <oz-agent@warp.dev>
@cla-bot cla-bot Bot added the cla-signed label Jul 1, 2026

@warp-dev-github-integration warp-dev-github-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR adds three deterministic Python scripts to replace free-form agent steps in the changelog-draft skill, along with 68 stdlib unittest tests and updated SKILL.md files. The overall design is solid, all tests pass, and CI is green.

All acceptance criteria from the spec are met:

  • resolve_release_range.py correctly encapsulates the Step 1 git-tag logic
  • classify_pr.py two-pass design enforces mechanical excludes over agent answers
  • assemble_changelog.py correctly merges Steps 7+8 with accounting invariant
  • classify-changelog-pr/SKILL.md properly reframed as subjective-candidate guidance
  • convert_to_release_json.py compatibility verified by tests

Two important code quality issues to fix before merge, plus a few nits.

Concerns

IMPORTANT — Dead candidate dict in classify_one_pr: The function builds a candidate dict (lines 324–338 of classify_pr.py) but always returns None. The variable is computed and immediately discarded. Both run_pass1 and run_pass2 then re-derive the same candidate data from scratch, duplicating the flag-detection and annotation logic. This is confusing because the function signature (dict | None) and docstring suggest it could return the candidate, but it never does. If a new candidate field is added to the classify_one_pr block, callers won't pick it up automatically. The fix is to either: (a) return the candidate dict when it needs agent judgment (changing callers to use the return value), or (b) drop the dead assignment and keep all candidate-building in run_pass1/run_pass2.

IMPORTANT — Community section synthesizes PR URLs from pr_number: The SKILL.md constraint says "If url is empty, omit the PR link from user-facing markdown rather than synthesizing one." The community contributor section in _build_markdown (around line 406 of assemble_changelog.py) hardcodes https://github.com/warpdotdev/warp/pull/{pn} from the PR number instead of using the url already stored on each entry. Since ext_contributors only stores PR numbers (not URLs), the URL must be reconstructed. In practice this always produces the correct URL for public warp PRs, but it violates the spec contract and would break if a contributor's PR URL were empty or non-standard. Fix: store (pr_number, url) pairs in ext_contributors so the community section can use the stored URL (and omit the link when empty).

Found: 0 critical, 2 important, 1 suggestion, 2 nits

Approve with nits — The implementation is functionally correct and the spec criteria are met. The IMPORTANT issues are code quality concerns that should be addressed before merge but don't affect the current correctness of the skill workflow.

Comment on lines +324 to +339
candidate: dict = {
"pr_number": pr_number,
"title": pr.get("title", ""),
"url": pr.get("url", ""),
"author": author,
"changed_files": changed_files,
"source_repo": pr.get("source_repo", ""),
}
if detected_flags:
candidate["detected_feature_flags"] = detected_flags
if uncertain_flag_touch:
candidate["uncertain_flag_touch"] = True
if author in unknown_bucket:
candidate["unknown_contributor"] = True

return None # signal: this PR needs agent judgment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [IMPORTANT] Dead candidate dict — computed but never returned or used.

The function builds candidate (and annotates it with flags/unknown-contributor info), then returns None. Both run_pass1 and run_pass2 re-derive the same data from scratch by re-running detect_feature_flags and re-checking the same conditions. This creates a maintenance hazard: if a new annotation field is added here, callers won't pick it up.

Two clean options:

  1. Return the candidate: change the return type to dict | dict[str, object] and return the candidate; callers use it directly instead of rebuilding.
  2. Drop the dead block: remove lines 324–343 from this function and keep all candidate-building solely in the callers.

Option 2 is simpler — the function's single responsibility is detecting mechanical excludes.

Comment on lines +404 to +411
for author, pr_numbers in sorted(ext_contributors.items()):
pr_links = ", ".join(
f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})"
if pr_numbers
else ""
for pn in sorted(pr_numbers)
)
lines.append(f"- {github_profile_link(author)} — {pr_links} ✨")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [IMPORTANT] Community section synthesizes PR URLs from pr_number instead of using the stored url.

The SKILL.md contract ("If url is empty, omit the PR link … rather than synthesizing one") applies here too. The entries list already carries a url field; but ext_contributors only stores pr_number, so the URL must be reconstructed.

Suggested fix: store (pr_number, url) pairs in ext_contributors and use the stored URL when present (omitting the link if it's empty):

Suggested change
for author, pr_numbers in sorted(ext_contributors.items()):
pr_links = ", ".join(
f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})"
if pr_numbers
else ""
for pn in sorted(pr_numbers)
)
lines.append(f"- {github_profile_link(author)}{pr_links} ✨")
for author, pr_entries in sorted(ext_contributors.items()):
pr_links = ", ".join(
f"[#{pn}]({url})" if url else f"#{pn}"
for pn, url in sorted(pr_entries)
)

This also requires changing how ext_contributors is populated:

ext_contributors.setdefault(a, [])
if pn not in (e[0] for e in ext_contributors[a]):
    ext_contributors[a].append((pn, entry.get("url") or ""))

Comment on lines +412 to +414
lines.append("")

if has_reporters:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 [NIT] if pr_numbers is always True inside this generator — it's the outer list variable, which is guaranteed non-empty since we're iterating over sorted(pr_numbers). The else "" branch is dead code.

Suggested change
lines.append("")
if has_reporters:
f"[#{pn}](https://github.com/warpdotdev/warp/pull/{pn})"
for pn in sorted(pr_numbers)

Comment on lines +558 to +563
print(
f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, "
f"{n_review} needs_review ({total} total PRs)"
)
print(f" wrote: {md_path}")
print(f" wrote: {json_path}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 [SUGGESTION] Summary goes to sys.stdout but classify_pr.py routes its summary to sys.stderr when --output is used (keeping stdout clean for data). Since assemble_changelog.py always writes to files, the summary is a progress message and should go to sys.stderr for consistency.

Suggested change
print(
f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, "
f"{n_review} needs_review ({total} total PRs)"
)
print(f" wrote: {md_path}")
print(f" wrote: {json_path}")
n_entries = len(draft_dict.get("entries", []))
n_skipped = len(draft_dict.get("skipped", []))
n_review = len(draft_dict.get("needs_review", []))
total = n_entries + n_skipped + n_review
print(
f"assemble_changelog: {n_entries} entries, {n_skipped} skipped, "
f"{n_review} needs_review ({total} total PRs)",
file=sys.stderr,
)
print(f" wrote: {md_path}", file=sys.stderr)
print(f" wrote: {json_path}", file=sys.stderr)

Comment on lines +22 to +25
def _run(args: list[str], repo_dir: str | None = None) -> subprocess.CompletedProcess:
"""Run the resolve_release_range.py script with given args."""
cmd = [sys.executable, _SCRIPT] + args
return subprocess.run(cmd, capture_output=True, text=True)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 [NIT] The repo_dir parameter is accepted but never used — all callers embed --repo-dir inside the args list instead. Either remove the unused parameter or actually pass it through as cwd to subprocess.run.

…ITY-944)

classify_pr.py:
- classify_one_pr now returns the candidate dict (source='candidate') instead
  of None, so the value it built is actually used
- run_pass1 and run_pass2 now use det.get('source') == 'deterministic' to
  distinguish excludes from candidates, removing duplicate candidate-building
  code that both callers had

assemble_changelog.py:
- ext_contributors now stores {pr_number, url} dicts instead of bare ints
- Community > Contributors section uses the stored url field; omits the
  hyperlink (shows plain #N) when url is empty, per the spec's
  'no synthesized PR URLs' constraint

Tests:
- TestClassifyOnePrReturnValue: verify classify_one_pr returns a non-None dict
  with source='candidate' and expected fields for agent-required PRs
- TestCommunityContributorUrls: verify stored URL is used in Contributors
  section, and that no synthesized URL appears when url is empty

Co-Authored-By: Oz <oz-agent@warp.dev>

@warp-dev-github-integration warp-dev-github-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR adds three deterministic Python scripts (resolve_release_range.py, classify_pr.py, assemble_changelog.py) plus a comprehensive test suite (72 tests) to the changelog-draft skill, replacing the ad-hoc in-skill logic with reproducible, independently testable steps.

Prior Issues — Confirmed Resolved

Issue 1 — classify_one_pr returning None instead of the candidate dict: Fixed. classify_one_pr now builds the candidate dict and returns it with source='candidate' (lines 329–345 of classify_pr.py). The duplicated dict-building code in the callers (run_pass1, run_pass2) has been removed; both now branch on det.get("source") == "deterministic" and route accordingly. Regression test TestClassifyOnePrReturnValue.test_candidate_pr_returns_dict_not_none directly validates this fix.

Issue 2 — Community section synthesizing PR URLs from pr_number: Fixed. The Community > Contributors section now uses e['url'] from the stored entry and falls back to plain #N (no hyperlink) when the URL is empty (lines 406–409 of assemble_changelog.py). Regression tests TestCommunityContributorUrls.test_community_section_uses_stored_url and test_community_section_omits_hyperlink_when_url_empty confirm both cases.

Test Results

72/72 tests pass.

Full Rubric Review

Correctness: ✅ Both prior bugs are fixed. run_pass2 correctly re-runs deterministic checks and reads flag annotations from the returned candidate dict. The seen dict in assemble correctly enforces the one-bucket invariant.

Standards/style: ✅ Clean module-level constants, good docstrings, consistent helper factoring.

Complexity/clarity: ✅ Two-pass design is well-structured and documented.

Naming: ✅ Function and variable names are consistent and descriptive.

Comments/docs: Minor only — the classify_pr.py module-level docstring summary schema shows included and needs_review fields (pass 2 only) and omits skipped_explicit_entries (present in both passes). No functional impact.

Tests: ✅ Regression tests for both prior issues are correctly isolated and clearly assertive. _make_pr, _pass1, _pass2, _assemble helpers reduce duplication. TestPySyntax.test_py_compile_all_scripts provides a fast syntax gate on all scripts.

Security:subprocess.run calls use list arguments (no shell=True) and handle non-zero exits explicitly.

Minor Nit (non-blocking)

The classify_pr.py module-level docstring summary schema (lines 64–70) includes included and needs_review (pass 2 fields) but omits skipped_explicit_entries (present in both passes). The runtime behavior is correct; this is a documentation-only inconsistency. Consider updating the docstring to reflect the actual per-pass schema.

Concerns

  • Found: 0 critical, 0 important, 1 nit

Verdict

Both prior blocking issues are confirmed resolved with regression tests. All 72 tests pass. The implementation matches the approved spec.

Approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant