Add Shipwright v2 benchmark harness by EdgeCaser · Pull Request #3 · EdgeCaser/shipwright

EdgeCaser · 2026-04-02T17:33:45Z

Summary

Deterministic benchmark harness in scripts/run-benchmarks.mjs: loads frozen fixture scenarios, runs the existing validator, computes PASS/FAIL/DNF status, normalizes blind ratings, and emits the scoring-spec result shape
Six frozen scenarios under benchmarks/scenarios/ with fixture artifacts covering hidden scope creep, partial-data pricing, board ambiguity, churn contradictions, weak evidence, and handoff contradictions
All fixture artifacts on schema_version: 2.0.0; each scenario exercises a detection → resolution cycle (first pass surfaces validator issues, final pass resolves them)
Default suite produces 4 PASS / 1 FAIL / 1 DNF; blind ratings left null until human review runs

Test plan

node --test tests/run-benchmarks.test.mjs — 6 tests: suite execution, blind-rating normalization, DNF diagnostic scoring, scenarioIds filtering, invalid scenario loading
node --test tests/*.test.mjs — 163 tests, all passing
node scripts/run-benchmarks.mjs --format json — expected mix confirmed
bash scripts/validate.sh — green

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d64281b84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T17:36:32Z

scripts/run-benchmarks.mjs

+function deriveScenarioStatus(finalPass) {
+  if (!finalPass.usable) return 'DNF';


Determine DNF from all passes, not final pass only

This marks a scenario as DNF whenever the final artifact is unusable, even if the first pass was usable. Under the scoring spec, DNF should only occur when no artifact reaches usability; a regression in a later draft should not erase an earlier usable artifact. With the current logic, scenarios that were usable on first pass can be misclassified as DNF and their time_to_first_usable_artifact_seconds is later nulled out, corrupting benchmark outcomes.

Useful? React with 👍 / 👎.

Add Shipwright v2 benchmark harness

8d64281

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

EdgeCaser merged commit fec0169 into main Apr 2, 2026
4 checks passed

EdgeCaser deleted the codex/shipwright-v2-benchmarks branch April 2, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Shipwright v2 benchmark harness#3

Add Shipwright v2 benchmark harness#3
EdgeCaser merged 1 commit intomainfrom
codex/shipwright-v2-benchmarks

EdgeCaser commented Apr 2, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		function deriveScenarioStatus(finalPass) {
		if (!finalPass.usable) return 'DNF';

Conversation

EdgeCaser commented Apr 2, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant