Skip to content

Add Shipwright v2 benchmark harness#3

Merged
EdgeCaser merged 1 commit intomainfrom
codex/shipwright-v2-benchmarks
Apr 2, 2026
Merged

Add Shipwright v2 benchmark harness#3
EdgeCaser merged 1 commit intomainfrom
codex/shipwright-v2-benchmarks

Conversation

@EdgeCaser
Copy link
Copy Markdown
Owner

Summary

  • Deterministic benchmark harness in scripts/run-benchmarks.mjs: loads frozen fixture scenarios, runs the existing validator, computes PASS/FAIL/DNF status, normalizes blind ratings, and emits the scoring-spec result shape
  • Six frozen scenarios under benchmarks/scenarios/ with fixture artifacts covering hidden scope creep, partial-data pricing, board ambiguity, churn contradictions, weak evidence, and handoff contradictions
  • All fixture artifacts on schema_version: 2.0.0; each scenario exercises a detection → resolution cycle (first pass surfaces validator issues, final pass resolves them)
  • Default suite produces 4 PASS / 1 FAIL / 1 DNF; blind ratings left null until human review runs

Test plan

  • node --test tests/run-benchmarks.test.mjs — 6 tests: suite execution, blind-rating normalization, DNF diagnostic scoring, scenarioIds filtering, invalid scenario loading
  • node --test tests/*.test.mjs — 163 tests, all passing
  • node scripts/run-benchmarks.mjs --format json — expected mix confirmed
  • bash scripts/validate.sh — green

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d64281b84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +287 to +288
function deriveScenarioStatus(finalPass) {
if (!finalPass.usable) return 'DNF';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Determine DNF from all passes, not final pass only

This marks a scenario as DNF whenever the final artifact is unusable, even if the first pass was usable. Under the scoring spec, DNF should only occur when no artifact reaches usability; a regression in a later draft should not erase an earlier usable artifact. With the current logic, scenarios that were usable on first pass can be misclassified as DNF and their time_to_first_usable_artifact_seconds is later nulled out, corrupting benchmark outcomes.

Useful? React with 👍 / 👎.

@EdgeCaser EdgeCaser merged commit fec0169 into main Apr 2, 2026
4 checks passed
@EdgeCaser EdgeCaser deleted the codex/shipwright-v2-benchmarks branch April 2, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant