Skip to content

Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate#404

Merged
igerber merged 5 commits intomainfrom
update-codex-review
May 9, 2026
Merged

Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate#404
igerber merged 5 commits intomainfrom
update-codex-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 9, 2026

Summary

  • Bumps OpenAI Codex model from gpt-5.4 to gpt-5.5 in both the CI workflow (.github/workflows/ai_pr_review.yml) and the local review script (.claude/scripts/openai_review.py), with PRICING entries added for gpt-5.5 / gpt-5.5-pro.
  • Adds a Single-Pass Completeness Mandate to .github/codex/prompts/pr_review.md instructing the reviewer to enumerate sibling-surface mirrors (BR/DR, schema/render, default/precomputed, summary/full), run pattern-wide greps, check reciprocal/symmetry directions, and sweep transitive workflow deps in the initial pass — addressing the round-by-round finding-drip pattern observed on prior PRs.
  • Re-review Scope amended: applies the same audits to newly-added code while preserving the existing [Newly identified] convention for unchanged code.
  • Skill doc .claude/commands/ai-review-local.md updated to keep parity (default model, reasoning-model heuristic, examples).

Behavior change to be aware of

_is_reasoning_model() now correctly classifies gpt-5.4 and gpt-5.5 as reasoning models (latent bug fix per OpenAI docs). For local-script users still passing --model gpt-5.4:

  • max_output_tokens doubles to 32768 (was 16384)
  • temperature=0 is omitted from the API payload
  • The "consider --timeout 900" advisory now fires

Methodology references

  • N/A — no diff_diff/ source files changed.

Validation

  • Tests added/updated: tests/test_openai_review.py — flipped test_gpt54_is_not_reasoningtest_gpt54_is_reasoning, added test_gpt55_is_reasoning / test_gpt55_pro_is_reasoning / test_gpt55_has_own_pricing / test_gpt55_pro_has_own_pricing, switched test_standard_model_payload to gpt-4.1 and test_reasoning_model_payload to gpt-5.5. All 156 tests pass.
  • Smoke-tested the script --dry-run path: banner shows Model: gpt-5.5, cost estimator returns the expected reasoning-tier number ($0.98 max output = 32768 tokens × $30/1M), no substitution-mismatch warning fires (confirms _SUBSTITUTIONS text update tracks the new prompt).

Cost note

gpt-5.5 is 2x the per-token cost of gpt-5.4 ($5/$30 vs $2.50/$15 per 1M). The Mandate's authorized broader audits will add input tokens — informally estimated at +50%-150% on top of the model bump. Quantitative re-evaluation criterion documented in the plan: if >2 of next 5 PRs surface new P1+ findings on unchanged code in round 2, revert the model bump.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

Updates the OpenAI Codex model from gpt-5.4 to gpt-5.5 across both the
CI workflow and the local review script (PRICING entries and reasoning-
model detection extended to cover the gpt-5.x family).

Adds a Single-Pass Completeness Mandate to the CI prompt that instructs
the reviewer to enumerate sibling-surface mirrors (BR/DR, schema/render,
default/precomputed, summary/full), run pattern-wide greps, check
reciprocal/symmetry directions, and sweep transitive workflow deps in
the initial pass rather than across multiple /ai-review rounds. Re-review
Scope amended to apply the same audits to newly-added code while
preserving the existing [Newly identified] convention for unchanged code.

Behavior change: _is_reasoning_model() now correctly classifies gpt-5.4
and gpt-5.5 as reasoning models (latent bug fix per OpenAI docs). For
local-script users still passing --model gpt-5.4: max_output_tokens
doubles to 32768, temperature=0 is omitted from the API payload, and
the "consider --timeout 900" advisory newly fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Overall Assessment: ⚠️ Needs changes

Executive Summary

  • No causal estimator, math, weighting, SE/variance, or identification code changed. Methodology registry cross-check found no affected method.
  • The gpt-5.5 model bump and short-context pricing entries are consistent with official OpenAI docs. citeturn15view0
  • P1: the new Single-Pass Completeness Mandate is applied to the local raw OpenAI review path, but that path cannot run shell greps or load arbitrary sibling/transitive files.
  • P1: the script default model is now reasoning (gpt-5.5), but direct script execution still defaults to a 300s timeout and only warns after parse.
  • Completeness audits run: sibling-surface mirror, pattern-wide grep, reciprocal reasoning/non-reasoning payload check, and transitive workflow/pytest dependency sweep.

Methodology

No findings.

Severity: P3 informational
Impact: No diff_diff/ estimator files changed, and docs/methodology/REGISTRY.md lists only estimator methodology requirements unrelated to this PR.
Concrete fix: None.

Code Quality

P1: Local review path cannot satisfy the new Single-Pass Mandate

Locations: .github/codex/prompts/pr_review.md:L57-L81, .claude/scripts/openai_review.py:L941-L956, .claude/scripts/openai_review.py:L1458-L1480, .claude/commands/ai-review-local.md:L311-L344

Impact: The prompt now requires reviewers to run sibling-surface audits, pattern-wide greps, reciprocal checks, and transitive dependency sweeps. CI Codex can do that with repo/tool access, but .claude/scripts/openai_review.py sends a static prompt to the Responses API. It only includes the diff, selected registry text, changed diff_diff/**/*.py files, and first-level imports. Local /ai-review-local reviews can therefore be instructed to “confirm” audits they cannot actually perform, creating incomplete or misleading local review output.

Concrete fix: Either remove/soften the Single-Pass “run audits” wording from the locally adapted prompt, or add an explicit local audit-context builder that injects grep outputs and relevant sibling/transitive files into the prompt. Add tests proving a local prompt for workflow/test changes includes pyproject.toml / tests/conftest.py when required, and that local prompts do not claim unavailable tool execution.

Performance

P1: Default reasoning model still uses the old 300s direct-script timeout

Locations: .claude/scripts/openai_review.py:L1100-L1103, .claude/scripts/openai_review.py:L1345-L1350, .claude/scripts/openai_review.py:L1619-L1637, .claude/commands/ai-review-local.md:L25-L30

Impact: DEFAULT_MODEL is now gpt-5.5, and _is_reasoning_model() correctly treats it as reasoning, but direct script users who omit --timeout still call the API with DEFAULT_TIMEOUT = 300. The docs say reasoning reviews may take 10-15 minutes and the command wrapper compensates, but the script’s own default path can time out on normal reviews.

Concrete fix: Track whether --timeout was explicitly provided; if omitted and _is_reasoning_model(args.model) is true, use 900 seconds before calling call_openai(). Add tests for omitted reasoning timeout -> 900, explicit timeout preserved, and non-reasoning default -> 300.

Maintainability

No additional findings.

Severity: P3 informational
Impact: Model/default changes are otherwise mirrored across workflow, script constants, local command docs, and tests.
Concrete fix: None beyond the P1 fixes above.

Tech Debt

No findings.

Severity: P3 informational
Impact: No new deferred item was added to TODO.md; the P1 issues above should be fixed rather than deferred because they affect review-tool behavior.
Concrete fix: None.

Security

No findings.

Severity: P3 informational
Impact: No secrets or credential material introduced in the changed files. Existing docs use placeholder-style examples only.
Concrete fix: None.

Documentation/Tests

P2: Tests do not cover the two changed default-contract paths

Locations: tests/test_openai_review.py:L212-L222, tests/test_openai_review.py:L1527-L1698

Impact: Tests cover substitution drift, pricing prefix behavior, and reasoning payload construction, but they do not catch the local mandate/tooling mismatch or the default reasoning timeout mismatch.

Concrete fix: Add tests that compile/adapt the real prompt for local mode and assert impossible tool-run language is removed or backed by injected audit context. Add timeout-selection tests as described above.

Path to Approval

  1. Fix local prompt execution parity: either inject audit context for the Single-Pass Mandate into .claude/scripts/openai_review.py, or adapt the local prompt so it does not require unavailable shell/file audits.
  2. Fix direct-script timeout propagation: default omitted reasoning-model runs to 900s while preserving explicit --timeout.
  3. Add regression tests for both behaviors in tests/test_openai_review.py.

Two related fixes recognizing that .claude/scripts/openai_review.py runs
as a static-prompt API call without shell or file-loading access — unlike
CI Codex which has sandbox tool access.

1. Reasoning-aware default --timeout. Previously the script defaulted to
   300s regardless of model, so direct script invocations on the new
   gpt-5.5 default could time out on reasoning runs (which can take
   10-15 min). New _resolve_timeout() helper picks 900s for reasoning
   models and 300s for standard models when --timeout is omitted;
   explicit --timeout values are preserved unchanged.

2. Soften the Single-Pass Completeness Mandate for local mode. The CI
   Mandate instructs the reviewer to run shell greps, load sibling
   files, and sweep transitive paths. The local raw-API path cannot do
   any of that, so leaving the CI wording in place caused the model to
   claim audits it could not perform. New _SUBSTITUTIONS entry replaces
   the CI Mandate block with a local-mode note that scopes audits to
   the loaded prompt content and explicitly forbids claiming
   shell/file-loading tool runs.

Adds 8 regression tests: TestResolveTimeout (6 cases covering omitted
+ reasoning, omitted + non-reasoning, explicit-preserved both paths,
zero-preserved, gpt-5.4 routing) and two additions to
TestAdaptReviewCriteria asserting the local prompt strips the CI
audit instructions and surfaces the local-mode note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 9b76cd49015c2dc9d83879538acbbb7d1c2e29ad


Overall Assessment: ✅ Looks good

Executive Summary

  • Re-review status: both prior P1s are resolved. The local prompt now replaces the CI-only Single-Pass mandate with a local no-tool-access note in .claude/scripts/openai_review.py:L898-L989, and that path is covered in tests/test_openai_review.py:L212-L250.
  • The direct-script timeout regression is resolved on the actual CLI path: --timeout now defaults to None, is resolved by model in _resolve_timeout(), and is passed explicitly into call_openai() in .claude/scripts/openai_review.py:L1155-L1179, .claude/scripts/openai_review.py:L1417-L1456, .claude/scripts/openai_review.py:L1697-L1710.
  • No estimator, weighting, variance/SE, identification, or default behavior in diff_diff/ changed, so the Methodology Registry does not surface a methodology defect for this PR.
  • P2: the local skill’s argument summary still says --timeout defaults to 300 seconds, but the script now auto-selects 900 for reasoning models and the default model is gpt-5.5 .claude/commands/ai-review-local.md:L25-L26, .claude/scripts/openai_review.py:L1417-L1456.
  • P2: call_openai() still exposes a stale lower-level default timeout=DEFAULT_TIMEOUT, even though the CLI now uses model-aware timeout resolution .claude/scripts/openai_review.py:L1206-L1210, .claude/scripts/openai_review.py:L1454-L1456.

Methodology

  • No findings. Only review tooling, prompt, workflow, and tests changed; no estimator or inference code under diff_diff/ was modified, so there is no new methodology obligation from docs/methodology/REGISTRY.md.

Code Quality

  • Severity: P2. Impact: the CLI path is fixed, but call_openai() still carries timeout: int = DEFAULT_TIMEOUT at .claude/scripts/openai_review.py:L1206-L1210 while main() now resolves a model-aware timeout before calling it at .claude/scripts/openai_review.py:L1454-L1456. That leaves the old 300s default reachable for any future direct caller of call_openai(), even though the top-level contract changed. Concrete fix: make timeout required or None in call_openai(), resolve it inside the helper, and add a regression test that an omitted timeout on gpt-5.5 propagates 900s to urlopen.

Performance

  • No findings. The previously reported reasoning-model timeout issue on the direct script path is addressed by _resolve_timeout() and explicit timeout propagation in .claude/scripts/openai_review.py:L1168-L1179, .claude/scripts/openai_review.py:L1454-L1456.

Maintainability

  • No findings. The model bump itself is mirrored across the CI workflow, script defaults, and reasoning-model tests in .github/workflows/ai_pr_review.yml:L163-L169, .claude/scripts/openai_review.py:L1155-L1165, and tests/test_openai_review.py:L1555-L1649.

Tech Debt

  • No findings. I did not find a related deferral already tracked in TODO.md; the remaining items above are low-severity cleanup, not blockers.

Security

  • No findings. I did not see secrets, credential material, or new unsafe file-loading/network behavior introduced by this diff.

Documentation/Tests

  • Severity: P2. Impact: the top-level /ai-review-local argument docs still say --timeout has a fixed 300s default in .claude/commands/ai-review-local.md:L25-L26, but the script now uses a dynamic default based on model in .claude/scripts/openai_review.py:L1417-L1456. Because the default model is now gpt-5.5, the first user-facing description of the flag is stale. Concrete fix: update that argument line to describe the dynamic default (900 for reasoning models, 300 otherwise), or explicitly note that the default gpt-5.5 path uses 900s.
  • Verification note: I could not run pytest in this sandbox (python -m pytest is unavailable and the filesystem is read-only), but openai_review.py parsed and imported cleanly, _resolve_timeout(None, "gpt-5.5") returned 900, _resolve_timeout(None, "gpt-4.1") returned 300, and _adapt_review_criteria() inserted the local Single-Pass audit note.

Two P2 polish items from CI review on 9b76cd4:

1. call_openai() signature still defaulted to timeout=DEFAULT_TIMEOUT,
   leaving any future direct caller exposed to the legacy 300s default
   even after the CLI gained model-aware resolution. Route the helper's
   default through _resolve_timeout() so the model-aware behavior
   applies uniformly regardless of caller.

2. .claude/commands/ai-review-local.md described --timeout as having a
   fixed 300s default, which is stale now that the script auto-selects
   900s for reasoning models. Update the description to reflect the
   dynamic default and enumerate the reasoning-model prefixes.

Adds 2 regression tests asserting omitted-timeout call_openai calls
propagate 900s on gpt-5.5 and 300s on gpt-4.1 to urlopen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8fea196403c6387e5e6e62456f95c5772ed48a13


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review status: the prior timeout/default-model issues are resolved. The local docs now describe dynamic timeout defaults, call_openai() resolves omitted timeouts internally, and the new helper/payload tests cover that path in .claude/scripts/openai_review.py:L1206-L1223, .claude/scripts/openai_review.py:L1423-L1462, and tests/test_openai_review.py:L1593-L1772.
  • No methodology-facing library code changed. Nothing under diff_diff/ moved, so docs/methodology/REGISTRY.md does not surface a methodology defect for this PR.
  • The main model bump itself looks externally consistent: official model pages match the added gpt-5.5 / gpt-5.5-pro price points and show GPT-5.4 / GPT-5.5 as reasoning-capable models. (developers.openai.com)
  • P2: the local skill now advertises gpt-5.5 as the default, but Step 5 still describes reasoning-model wrapper behavior in terms of an implicit “model” and never defines the effective default before applying that branch in .claude/commands/ai-review-local.md:L49-L50 and .claude/commands/ai-review-local.md:L302-L341.
  • P2: the new pricing tests only prove the new entries differ from neighboring models, not that they match the documented rates, in tests/test_openai_review.py:L1635-L1649.

Methodology

  • No findings. This PR only changes review tooling, prompt text, workflow config, and tests; no estimator, weighting, variance/SE, identification, or diff_diff/ default behavior changed.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P2. Impact: .claude/commands/ai-review-local.md now says the no-argument default is gpt-5.5, but Step 5’s special handling still keys off an undefined “model” instead of an explicit effective-model variable in .claude/commands/ai-review-local.md:L49-L50 and .claude/commands/ai-review-local.md:L302-L341. That makes the wrapper/script contract underspecified for default local runs and increases drift risk around background execution and timeout handling. Concrete fix: define an effective_model during argument parsing (parsed --model or default gpt-5.5) and use that same variable for the Step 5 reasoning/background logic; also state explicitly that --model, --timeout, and --dry-run pass through when provided.

Tech Debt

  • No findings. I did not find a matching deferred item in TODO.md for the issues above.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new pricing tests in tests/test_openai_review.py:L1635-L1649 only assert inequality, so incorrect but distinct gpt-5.5 / gpt-5.5-pro prices would still pass. Official model docs publish exact values for both models. Concrete fix: assert the exact PRICING tuples or exact rendered cost strings for gpt-5.5 and gpt-5.5-pro, and add one snapshot-prefix case for gpt-5.5-pro-2026-04-23. (developers.openai.com)

Verification note: pytest is unavailable in this sandbox (python3 -m pytest is missing), so I verified the changed script by direct import instead: DEFAULT_MODEL == "gpt-5.5", _resolve_timeout(None, "gpt-5.5") == 900, _resolve_timeout(None, "gpt-4.1") == 300, and _adapt_review_criteria() inserts the local Single-Pass audit note.

Two P2 polish items from CI review on 8fea196:

1. Skill doc Step 5 reasoning-model handling referred to "the model"
   without defining it, leaving the wrapper/script contract ambiguous
   for default local runs. Define effective_model = parsed --model OR
   default gpt-5.5, and note that --model/--timeout/--dry-run pass
   through. Also reflect that the script's _resolve_timeout() now
   handles the 900s timeout selection internally, making explicit
   --timeout 900 pass-through optional.

2. Pricing tests previously only asserted gpt-5.5 != gpt-5.4 and
   gpt-5.5-pro != gpt-5.5, which would silently accept incorrect
   distinct rates. Add exact-tuple assertions matching the OpenAI
   published rates ($5/$30 for gpt-5.5, $30/$180 for gpt-5.5-pro)
   and a snapshot-prefix test confirming gpt-5.5-pro-2026-04-23
   resolves to the same rate as gpt-5.5-pro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 600a4acfc891e2b56a0d825b937e61f4106a321d


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review status: the prior P2s are resolved. The local skill now defines effective_model explicitly and documents --model / --timeout / --dry-run pass-through behavior in .claude/commands/ai-review-local.md:L337-L345.
  • The earlier pricing-test gap is also resolved: exact gpt-5.5 / gpt-5.5-pro assertions and a gpt-5.5-pro-2026-04-23 prefix case were added in tests/test_openai_review.py:L1651-L1663.
  • No methodology-facing library code changed. Nothing under diff_diff/ moved, so docs/methodology/REGISTRY.md does not surface a methodology defect for this PR.
  • The new Single-Pass prompt/workflow surfaces are mirrored cleanly: the carve-outs in .github/codex/prompts/pr_review.md:L57-L109 match the workflow diff exclusions in .github/workflows/ai_pr_review.yml:L150-L156, and the local adapter/tests now replace the CI-only audit block with a no-tools note in .claude/scripts/openai_review.py:L938-L1012 and tests/test_openai_review.py:L224-L250.
  • Severity: P3 [Newly identified]. The local skill’s transmission note still says “Chat Completions API” in .claude/commands/ai-review-local.md:L507-L511, which is stale relative to the Responses API implementation in .claude/scripts/openai_review.py:L1155-L1243.

Methodology
No findings. This PR only changes review tooling, docs, workflow, and tests; it does not change estimator math, weighting, SE/variance logic, identification assumptions, or diff_diff/ defaults.

Code Quality
No findings. Timeout selection is now centralized in _resolve_timeout() and reused by both CLI parsing and call_openai() in .claude/scripts/openai_review.py:L1168-L1223 and .claude/scripts/openai_review.py:L1423-L1462, which resolves the earlier direct-caller/default drift.

Performance
No findings. The behavior change is request timeout/model selection rather than local algorithmic work, and the workflow’s diff-pruning surface remains aligned with the prompt carve-outs.

Maintainability
No findings. The previous effective-model ambiguity is resolved in .claude/commands/ai-review-local.md:L337-L345, and the CI/local prompt split is covered by tests/test_openai_review.py:L212-L250.

Tech Debt
No findings. I did not find a matching TODO.md item relevant to this PR, and nothing here needs deferred tracking to be acceptable.

Security
No findings. No new secret-handling, path-confinement, or data-exfiltration regression is introduced by the diff.

Documentation/Tests

  • Severity: P3 [Newly identified]. Impact: .claude/commands/ai-review-local.md:L507-L511 still tells users their payload is sent via the Chat Completions API, while the script and the rest of the doc use the Responses API in .claude/scripts/openai_review.py:L1155-L1243 and .claude/commands/ai-review-local.md:L8-L9. That leaves the user-facing transmission note internally inconsistent after the migration. Concrete fix: change that note to “Responses API” and add a lightweight text regression test or doc grep so this surface stays aligned with the script.
  • No other findings. The prior pricing-test concern is resolved by the exact-rate and snapshot-prefix assertions in tests/test_openai_review.py:L1651-L1663.

Verification note: python3 -m pytest tests/test_openai_review.py -q could not run here because pytest is not installed. I verified the changed behavior by direct import instead: DEFAULT_MODEL == "gpt-5.5", _resolve_timeout(None, "gpt-5.5") == 900, _resolve_timeout(None, "gpt-4.1") == 300, _is_reasoning_model("gpt-5.4") is True, estimate_cost(..., "gpt-5.5-pro-2026-04-23") resolves to pro pricing, and _adapt_review_criteria() inserts the local Single-Pass audit note while removing the CI-only grep instruction.

The data-transmission note in .claude/commands/ai-review-local.md said
the script POSTs to OpenAI's Chat Completions API. The script has used
the Responses API (ENDPOINT = .../v1/responses) for some time; this is
a leftover from before that migration.

Adds a TestSkillDocAPIConsistency regression test that grep-asserts the
skill doc never says "Chat Completions API" again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: dacba16a923447cab5b509ecfc95bfb1663c3beb


Overall Assessment

✅ Looks good

This re-review resolves the prior P3 doc inconsistency, and I did not find any unmitigated P0/P1 issues in the changed surfaces.

Executive Summary

  • Prior P3 is resolved: .claude/commands/ai-review-local.md:L507-L511 now says Responses API, and tests/test_openai_review.py:L1666-L1681 adds a regression test for that exact drift.
  • No methodology-facing library code changed. The diff is limited to review tooling, workflow, prompt text, and tests, so there is no estimator/math/SE/identification/default-behavior change under diff_diff/ to cross-check against docs/methodology/REGISTRY.md.
  • The model bump is mirrored cleanly across the CI workflow, local script default, skill docs/examples, pricing table, reasoning-model classification, and tests: .github/workflows/ai_pr_review.yml:L159-L169, .claude/scripts/openai_review.py:L852-L889,L1155-L1223,L1423-L1462, .claude/commands/ai-review-local.md:L25-L32,L337-L345, tests/test_openai_review.py:L224-L250,L1555-L1804.
  • The new GPT-5.5 / GPT-5.5-pro pricing entries and reasoning-model classification match current official OpenAI model docs, and OpenAI describes GPT-5.4 as a reasoning model, so the new timeout path is consistent with upstream model behavior. (developers.openai.com)
  • The Single-Pass mirror audit passed: the CI mandate added in .github/codex/prompts/pr_review.md:L57-L109 is paired with the local no-tools replacement in .claude/scripts/openai_review.py:L938-L1012, and the workflow diff carve-outs still match .github/workflows/ai_pr_review.yml:L150-L156.
  • Verification note: python3 -m pytest tests/test_openai_review.py -q could not run here because pytest is not installed. I verified the changed behavior by direct import and --dry-run checks instead.

Methodology

No findings. Affected methods: none. This PR does not touch estimator implementations, weighting, variance/SE logic, identification checks, or diff_diff/ defaults.

Code Quality

No findings. _resolve_timeout() is now the single resolver for both CLI parsing and direct call_openai() callers in .claude/scripts/openai_review.py:L1168-L1223,L1423-L1462, which removes the earlier caller/default drift. The upstream model docs also support the chosen GPT-5.5 / GPT-5.5-pro price points and reasoning classification. (developers.openai.com)

Performance

No findings. .github/workflows/ai_pr_review.yml changes only the selected model line; there are no paths: or pytest-selection changes, so the transitive workflow-dependency audit is not implicated here.

Maintainability

No findings. The sibling-surface mirror audit passed across workflow/script/doc/test surfaces, and the local prompt adapter now explicitly strips CI-only audit instructions while tests lock that behavior in: .claude/scripts/openai_review.py:L938-L1012, tests/test_openai_review.py:L224-L250.

Tech Debt

No findings. TODO.md does not need a new entry for anything in this tooling-only diff.

Security

No findings. The PR does not widen data exposure, and the user-facing transmission note now matches the actual Responses API implementation in .claude/scripts/openai_review.py:L1155-L1243 and .claude/commands/ai-review-local.md:L507-L511.

Documentation/Tests

No findings. The previous P3 is resolved, and the added tests cover GPT-5.5 pricing, GPT-5.4 / GPT-5.5 reasoning classification, timeout resolution, and the local prompt’s no-shell-access replacement block: tests/test_openai_review.py:L224-L250, tests/test_openai_review.py:L1555-L1804.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 9, 2026
@igerber igerber merged commit 7b6805c into main May 9, 2026
24 of 25 checks passed
@igerber igerber deleted the update-codex-review branch May 9, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant