Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate#404
Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate#404
Conversation
Updates the OpenAI Codex model from gpt-5.4 to gpt-5.5 across both the CI workflow and the local review script (PRICING entries and reasoning- model detection extended to cover the gpt-5.x family). Adds a Single-Pass Completeness Mandate to the CI prompt that instructs the reviewer to enumerate sibling-surface mirrors (BR/DR, schema/render, default/precomputed, summary/full), run pattern-wide greps, check reciprocal/symmetry directions, and sweep transitive workflow deps in the initial pass rather than across multiple /ai-review rounds. Re-review Scope amended to apply the same audits to newly-added code while preserving the existing [Newly identified] convention for unchanged code. Behavior change: _is_reasoning_model() now correctly classifies gpt-5.4 and gpt-5.5 as reasoning models (latent bug fix per OpenAI docs). For local-script users still passing --model gpt-5.4: max_output_tokens doubles to 32768, temperature=0 is omitted from the API payload, and the "consider --timeout 900" advisory newly fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Overall Assessment:
|
Two related fixes recognizing that .claude/scripts/openai_review.py runs as a static-prompt API call without shell or file-loading access — unlike CI Codex which has sandbox tool access. 1. Reasoning-aware default --timeout. Previously the script defaulted to 300s regardless of model, so direct script invocations on the new gpt-5.5 default could time out on reasoning runs (which can take 10-15 min). New _resolve_timeout() helper picks 900s for reasoning models and 300s for standard models when --timeout is omitted; explicit --timeout values are preserved unchanged. 2. Soften the Single-Pass Completeness Mandate for local mode. The CI Mandate instructs the reviewer to run shell greps, load sibling files, and sweep transitive paths. The local raw-API path cannot do any of that, so leaving the CI wording in place caused the model to claim audits it could not perform. New _SUBSTITUTIONS entry replaces the CI Mandate block with a local-mode note that scopes audits to the loaded prompt content and explicitly forbids claiming shell/file-loading tool runs. Adds 8 regression tests: TestResolveTimeout (6 cases covering omitted + reasoning, omitted + non-reasoning, explicit-preserved both paths, zero-preserved, gpt-5.4 routing) and two additions to TestAdaptReviewCriteria asserting the local prompt strips the CI audit instructions and surfaces the local-mode note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment: ✅ Looks goodExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Two P2 polish items from CI review on 9b76cd4: 1. call_openai() signature still defaulted to timeout=DEFAULT_TIMEOUT, leaving any future direct caller exposed to the legacy 300s default even after the CLI gained model-aware resolution. Route the helper's default through _resolve_timeout() so the model-aware behavior applies uniformly regardless of caller. 2. .claude/commands/ai-review-local.md described --timeout as having a fixed 300s default, which is stale now that the script auto-selects 900s for reasoning models. Update the description to reflect the dynamic default and enumerate the reasoning-model prefixes. Adds 2 regression tests asserting omitted-timeout call_openai calls propagate 900s on gpt-5.5 and 300s on gpt-4.1 to urlopen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: |
Two P2 polish items from CI review on 8fea196: 1. Skill doc Step 5 reasoning-model handling referred to "the model" without defining it, leaving the wrapper/script contract ambiguous for default local runs. Define effective_model = parsed --model OR default gpt-5.5, and note that --model/--timeout/--dry-run pass through. Also reflect that the script's _resolve_timeout() now handles the 900s timeout selection internally, making explicit --timeout 900 pass-through optional. 2. Pricing tests previously only asserted gpt-5.5 != gpt-5.4 and gpt-5.5-pro != gpt-5.5, which would silently accept incorrect distinct rates. Add exact-tuple assertions matching the OpenAI published rates ($5/$30 for gpt-5.5, $30/$180 for gpt-5.5-pro) and a snapshot-prefix test confirming gpt-5.5-pro-2026-04-23 resolves to the same rate as gpt-5.5-pro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology Code Quality Performance Maintainability Tech Debt Security Documentation/Tests
Verification note: |
The data-transmission note in .claude/commands/ai-review-local.md said the script POSTs to OpenAI's Chat Completions API. The script has used the Responses API (ENDPOINT = .../v1/responses) for some time; this is a leftover from before that migration. Adds a TestSkillDocAPIConsistency regression test that grep-asserts the skill doc never says "Chat Completions API" again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good This re-review resolves the prior P3 doc inconsistency, and I did not find any unmitigated P0/P1 issues in the changed surfaces. Executive Summary
Methodology No findings. Affected methods: none. This PR does not touch estimator implementations, weighting, variance/SE logic, identification checks, or Code Quality No findings. Performance No findings. Maintainability No findings. The sibling-surface mirror audit passed across workflow/script/doc/test surfaces, and the local prompt adapter now explicitly strips CI-only audit instructions while tests lock that behavior in: Tech Debt No findings. Security No findings. The PR does not widen data exposure, and the user-facing transmission note now matches the actual Responses API implementation in Documentation/Tests No findings. The previous P3 is resolved, and the added tests cover GPT-5.5 pricing, GPT-5.4 / GPT-5.5 reasoning classification, timeout resolution, and the local prompt’s no-shell-access replacement block: |
Summary
gpt-5.4togpt-5.5in both the CI workflow (.github/workflows/ai_pr_review.yml) and the local review script (.claude/scripts/openai_review.py), with PRICING entries added forgpt-5.5/gpt-5.5-pro..github/codex/prompts/pr_review.mdinstructing the reviewer to enumerate sibling-surface mirrors (BR/DR, schema/render, default/precomputed, summary/full), run pattern-wide greps, check reciprocal/symmetry directions, and sweep transitive workflow deps in the initial pass — addressing the round-by-round finding-drip pattern observed on prior PRs.[Newly identified]convention for unchanged code..claude/commands/ai-review-local.mdupdated to keep parity (default model, reasoning-model heuristic, examples).Behavior change to be aware of
_is_reasoning_model()now correctly classifiesgpt-5.4andgpt-5.5as reasoning models (latent bug fix per OpenAI docs). For local-script users still passing--model gpt-5.4:max_output_tokensdoubles to 32768 (was 16384)temperature=0is omitted from the API payloadMethodology references
diff_diff/source files changed.Validation
tests/test_openai_review.py— flippedtest_gpt54_is_not_reasoning→test_gpt54_is_reasoning, addedtest_gpt55_is_reasoning/test_gpt55_pro_is_reasoning/test_gpt55_has_own_pricing/test_gpt55_pro_has_own_pricing, switchedtest_standard_model_payloadtogpt-4.1andtest_reasoning_model_payloadtogpt-5.5. All 156 tests pass.--dry-runpath: banner showsModel: gpt-5.5, cost estimator returns the expected reasoning-tier number ($0.98 max output = 32768 tokens × $30/1M), no substitution-mismatch warning fires (confirms_SUBSTITUTIONStext update tracks the new prompt).Cost note
gpt-5.5is 2x the per-token cost ofgpt-5.4($5/$30 vs $2.50/$15 per 1M). The Mandate's authorized broader audits will add input tokens — informally estimated at +50%-150% on top of the model bump. Quantitative re-evaluation criterion documented in the plan: if >2 of next 5 PRs surface new P1+ findings on unchanged code in round 2, revert the model bump.Security / privacy
🤖 Generated with Claude Code