Skip to content

🤖 bench: use GPT-5.5 for tbench#3193

Open
ibetitsmike wants to merge 1 commit intomainfrom
mike/tbench-eq2r
Open

🤖 bench: use GPT-5.5 for tbench#3193
ibetitsmike wants to merge 1 commit intomainfrom
mike/tbench-eq2r

Conversation

@ibetitsmike
Copy link
Copy Markdown
Contributor

Mux working on behalf of Mike.

Summary

Updates nightly Terminal-Bench defaults to run Opus 4.7 and GPT-5.5 at xhigh thinking while dropping the older GPT Codex model from the default matrix. Adds leaderboard metadata for Opus 4.7 and GPT-5.5, and refreshes TBench workflow and skill examples.

Validation

  • make static-check
  • python3 -m py_compile benchmarks/terminal_bench/prepare_leaderboard_submission.py
  • go run github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 .github/workflows/nightly-terminal-bench.yml .github/workflows/terminal-bench.yml
  • /home/coder/.local/bin/uvx ruff format --check benchmarks/terminal_bench/prepare_leaderboard_submission.py
  • git diff --check

Generated with mux • Model: openai:gpt-5.5 • Thinking: xhigh • Cost: $9.21

Switch nightly Terminal-Bench defaults to GPT-5.5 and Opus 4.7 with xhigh thinking. Add leaderboard metadata for both models and update tbench examples.

---

_Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `$9.21`_

<!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=9.21 -->
@ibetitsmike
Copy link
Copy Markdown
Contributor Author

@codex review

Mux working on behalf of Mike.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant