Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions .claude/commands/ai-review-local.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@ pre-PR use. Designed for iterative review/revision cycles before submitting a PR
files (default: 200000). Changed source files are always included regardless of budget.
- `--force-fresh`: Skip delta-diff mode, run a full fresh review even if previous state exists
- `--full-registry`: Include the entire REGISTRY.md instead of selective sections
- `--model <name>`: Override the OpenAI model (default: `gpt-5.4`)
- `--timeout <seconds>`: HTTP request timeout (default: 300). Use 900 for reasoning models.
- `--model <name>`: Override the OpenAI model (default: `gpt-5.5`)
- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4, gpt-5.5, *-pro, o1/o3/o4) and 300 otherwise.
- `--dry-run`: Print the compiled prompt without calling the API

**Reasoning models** (`gpt-5.4-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
**Reasoning models** (`gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
minutes. For deep reviews with reasoning models, combine `--token-budget` with `--model`:
```
/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep
```

## Constraints
Expand All @@ -47,7 +47,7 @@ before any data is sent externally.
### Step 1: Parse Arguments

Parse `$ARGUMENTS` for the optional flags listed above. All flags are optional —
the default behavior (standard context, selective registry, gpt-5.4, live API call)
the default behavior (standard context, selective registry, gpt-5.5, live API call)
requires no arguments.

### Step 2: Validate Prerequisites
Expand Down Expand Up @@ -334,9 +334,15 @@ python3 .claude/scripts/openai_review.py \
Note: `--force-fresh` is a skill-only flag — it controls whether delta diffs are
generated in Step 4 and is NOT passed to the script.

**Reasoning model handling:** If the model contains `-pro` or starts with `o1`/`o3`/`o4`
(e.g., `gpt-5.4-pro`, `o3`, `o4-mini`):
- Pass `--timeout 900` to the script (unless the user explicitly specified `--timeout`)
**Reasoning model handling:** Resolve the effective model first — `effective_model` is
the value of `--model` if the user provided one, otherwise the script default `gpt-5.5`.
The `--model`, `--timeout`, and `--dry-run` flags pass through to the script when provided.

If `effective_model` contains `-pro`, starts with `o1`/`o3`/`o4`, or starts with
`gpt-5.4`/`gpt-5.5` (e.g., `gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`):
- The script's `_resolve_timeout()` already auto-selects 900s for these models when
`--timeout` is omitted, so no wrapper timeout pass-through is required. (Passing
`--timeout 900` explicitly remains harmless and is fine for backward compatibility.)
- Run the Bash command with `run_in_background: true` (bypasses the 600s Bash tool timeout cap)
- After the background command completes, continue to Step 6

Expand Down Expand Up @@ -466,7 +472,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
/ai-review-local --model gpt-4.1 --full-registry

# Deep review with reasoning model (may take 10-15 minutes)
/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep

# Limit token budget for faster/cheaper reviews
/ai-review-local --token-budget 100000
Expand Down Expand Up @@ -501,7 +507,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
- **Data transmission**: In non-dry-run mode, this skill transmits the unified diff,
changed-file metadata, full source file contents (in standard/deep mode),
import-context files (in deep mode), selected methodology registry text, and
prior review context (if present) to OpenAI via the Chat Completions API.
prior review context (if present) to OpenAI via the Responses API.
Use `--dry-run` to preview exactly what would be sent.
- This skill pairs naturally with the iterative workflow:
`/ai-review-local` -> address findings -> `/ai-review-local` -> `/submit-pr`
111 changes: 96 additions & 15 deletions .claude/scripts/openai_review.py
Original file line number Diff line number Diff line change
Expand Up @@ -853,6 +853,8 @@ def apply_token_budget(
# Source: https://platform.openai.com/docs/pricing
# MAINTENANCE: Update when OpenAI changes pricing.
PRICING = {
"gpt-5.5": (5.00, 30.00),
"gpt-5.5-pro": (30.00, 180.00),
"gpt-5.4": (2.50, 15.00),
"gpt-5.4-pro": (30.00, 180.00),
"gpt-4.1": (2.00, 8.00),
Expand Down Expand Up @@ -900,8 +902,8 @@ def estimate_cost(
"code changes that have not yet been submitted as a pull request.",
),
(
"Review ONLY the changes introduced by this PR (diff)",
"Review ONLY the changes shown in the diff below",
"Review the changes introduced by this PR (diff)",
"Review the code changes shown in the diff below",
),
(
"If the PR changes an estimator",
Expand Down Expand Up @@ -933,6 +935,62 @@ def estimate_cost(
"Use the branch name only to understand which "
"methods/papers are intended.",
),
# Replace the CI Single-Pass Completeness Mandate with a local-mode note.
# The CI Mandate instructs the reviewer to run shell greps, load sibling
# files, and sweep transitive paths — none of which the local raw API
# path can do. Leaving the CI wording in place would cause the model to
# claim audits it cannot perform.
(
"""## Single-Pass Completeness Mandate (Initial Review Only)

This is an INITIAL review. Treat this as the only chance to enumerate findings.
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.

Before finalizing, confirm you have run each of these audits on the diff:

1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
default-value path, or report block, identify the parallel surface in the same
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
and check whether the same change applies there. Flag the unmirrored side as P1.

2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
if grep returns a hit and you need surrounding context to verify the issue.
Do not defer pattern-class findings to a follow-up round.

3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
and confirm coverage.

4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
selection changes, sweep transitive auto-loaded files (conftest.py,
pyproject.toml, ancestor conftests) and confirm they are included.

5. **Scope override (with carve-outs)**: The audits above explicitly authorize
loading files outside the diff to verify completeness. This overrides the
"minimum surrounding context" default in the Rules section below.

**DO NOT load these paths** (the workflow's diff-build deliberately excludes
them; they are noise or out-of-scope):
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
- `benchmarks/data/real/*.json`
- `benchmarks/data/real/*.csv`""",
"""## Single-Pass Completeness Audit (Local Review)

This is a local review running as a static-prompt API call. You do NOT have
shell or file-loading access — only the prompt content below is available
(diff + changed source files + first-level imports).

Find ALL P0/P1/P2 issues within the loaded context. Audit sibling surfaces,
parallel patterns, and reciprocal directions THAT ARE VISIBLE in the loaded
files.

Do NOT claim to have run shell greps, loaded sibling files outside the
prompt, or audited paths not present here. If a relevant audit is impossible
because the necessary context is not in the prompt, say so explicitly rather
than asserting completeness.""",
),
]


Expand Down Expand Up @@ -1095,15 +1153,30 @@ def compile_prompt(
# ---------------------------------------------------------------------------

ENDPOINT = "https://api.openai.com/v1/responses"
DEFAULT_MODEL = "gpt-5.4"
DEFAULT_TIMEOUT = 300 # seconds
DEFAULT_MODEL = "gpt-5.5"
DEFAULT_TIMEOUT = 300 # seconds — non-reasoning models
REASONING_TIMEOUT = 900 # seconds — reasoning models can take 10-15 min
DEFAULT_MAX_TOKENS = 16384
REASONING_MAX_TOKENS = 32768


def _is_reasoning_model(model: str) -> bool:
"""Return True for models that use internal chain-of-thought reasoning."""
return model.startswith(("o1", "o3", "o4")) or "-pro" in model
return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model


def _resolve_timeout(timeout: "int | None", model: str) -> int:
"""Resolve the effective HTTP timeout for an API call.

If --timeout was explicitly provided, use it. Otherwise pick
REASONING_TIMEOUT (900s) for reasoning models and DEFAULT_TIMEOUT
(300s) for standard models. This prevents reasoning-model reviews
from hitting a too-short default when the wrapper command does not
pass --timeout.
"""
if timeout is not None:
return timeout
return REASONING_TIMEOUT if _is_reasoning_model(model) else DEFAULT_TIMEOUT


def estimate_tokens(text: str) -> int:
Expand Down Expand Up @@ -1134,13 +1207,19 @@ def call_openai(
prompt: str,
model: str,
api_key: str,
timeout: int = DEFAULT_TIMEOUT,
timeout: "int | None" = None,
) -> "tuple[str, dict]":
"""Call the OpenAI Responses API.

If ``timeout`` is None, resolves to REASONING_TIMEOUT (900s) for
reasoning models and DEFAULT_TIMEOUT (300s) otherwise — same logic
as the CLI ``--timeout`` flag. This guards future direct callers
against the old 300s-everywhere default.

Returns (content, usage) where usage is the API response's usage dict
containing input_tokens and output_tokens.
"""
timeout = _resolve_timeout(timeout, model)
reasoning = _is_reasoning_model(model)
max_tokens = REASONING_MAX_TOKENS if reasoning else DEFAULT_MAX_TOKENS

Expand Down Expand Up @@ -1343,8 +1422,12 @@ def main() -> None:
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT,
help=f"HTTP request timeout in seconds (default: {DEFAULT_TIMEOUT})",
default=None,
help=(
f"HTTP request timeout in seconds. If omitted, defaults to "
f"{REASONING_TIMEOUT} for reasoning models and {DEFAULT_TIMEOUT} for "
f"standard models."
),
)
parser.add_argument(
"--delta-diff",
Expand Down Expand Up @@ -1374,6 +1457,10 @@ def main() -> None:

args = parser.parse_args()

# Resolve --timeout: reasoning models default to REASONING_TIMEOUT (900s),
# standard models to DEFAULT_TIMEOUT (300s). Explicit --timeout overrides.
args.timeout = _resolve_timeout(args.timeout, args.model)

# Post-parse validation
if args.context != "minimal" and not args.repo_root:
parser.error(
Expand Down Expand Up @@ -1614,13 +1701,7 @@ def main() -> None:
sys.exit(0)

# Call OpenAI API
if _is_reasoning_model(args.model) and args.timeout == DEFAULT_TIMEOUT:
print(
f"Note: {args.model} is a reasoning model. Consider --timeout 900 "
"for large reviews.",
file=sys.stderr,
)
print(f"Sending review to {args.model}...", file=sys.stderr)
print(f"Sending review to {args.model} (timeout: {args.timeout}s)...", file=sys.stderr)
print(f"Estimated input tokens: ~{est_tokens:,}", file=sys.stderr)
if cost_str:
print(f"Estimated cost: {cost_str}", file=sys.stderr)
Expand Down
46 changes: 44 additions & 2 deletions .github/codex/prompts/pr_review.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,42 @@ When reviewing new features or code paths, specifically check:
- Command to check: `grep -n "pattern" diff_diff/*.py`
- Flag as P1 if only partial fixes were made

## Single-Pass Completeness Mandate (Initial Review Only)

This is an INITIAL review. Treat this as the only chance to enumerate findings.
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.

Before finalizing, confirm you have run each of these audits on the diff:

1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
default-value path, or report block, identify the parallel surface in the same
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
and check whether the same change applies there. Flag the unmirrored side as P1.

2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
if grep returns a hit and you need surrounding context to verify the issue.
Do not defer pattern-class findings to a follow-up round.

3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
and confirm coverage.

4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
selection changes, sweep transitive auto-loaded files (conftest.py,
pyproject.toml, ancestor conftests) and confirm they are included.

5. **Scope override (with carve-outs)**: The audits above explicitly authorize
loading files outside the diff to verify completeness. This overrides the
"minimum surrounding context" default in the Rules section below.

**DO NOT load these paths** (the workflow's diff-build deliberately excludes
them; they are noise or out-of-scope):
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
- `benchmarks/data/real/*.json`
- `benchmarks/data/real/*.csv`

## Deferred Work Acceptance

This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews."
Expand All @@ -68,7 +104,9 @@ This project tracks deferred technical debt in `TODO.md` under "Tech Debt from C
and incorrect statistical output are not.

Rules:
- Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed.
- Review the changes introduced by this PR (diff). The Single-Pass Completeness
Mandate above authorizes broader audits (sibling surfaces, pattern-wide greps,
reciprocal checks, transitive deps) — do those upfront rather than deferring.
- Provide a single Markdown report with:
- Overall assessment (see Assessment Criteria below)
- Executive summary (3–6 bullets)
Expand Down Expand Up @@ -116,7 +154,11 @@ When this is a re-review (the PR has prior AI review comments):
- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]"
to distinguish from moving goalposts. Limit these to clear, concrete issues — not
speculative concerns or stylistic preferences.
- New code added since the last review IS in scope for new findings.
- New code added since the last review IS in scope for new findings — apply the
Single-Pass Completeness Mandate's audits (sibling surfaces, pattern-wide greps,
reciprocal checks) to that new code in this re-review pass. For UNCHANGED code,
the existing [Newly identified] convention from the bullet above still applies:
new P1+ findings MAY be raised but must be marked "[Newly identified]".
- If all previous P1+ findings are resolved, the assessment should be ✅ even if new
P2/P3 items are noticed.

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ai_pr_review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ jobs:
sandbox: read-only
safety-strategy: drop-sudo
# Recommended by OpenAI for review quality/consistency:
model: gpt-5.4
model: gpt-5.5
effort: xhigh

- name: Post PR comment (new on /ai-review, update on opened)
Expand Down
Loading
Loading