Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d64281b84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function deriveScenarioStatus(finalPass) { | ||
| if (!finalPass.usable) return 'DNF'; |
There was a problem hiding this comment.
Determine DNF from all passes, not final pass only
This marks a scenario as DNF whenever the final artifact is unusable, even if the first pass was usable. Under the scoring spec, DNF should only occur when no artifact reaches usability; a regression in a later draft should not erase an earlier usable artifact. With the current logic, scenarios that were usable on first pass can be misclassified as DNF and their time_to_first_usable_artifact_seconds is later nulled out, corrupting benchmark outcomes.
Useful? React with 👍 / 👎.
Summary
scripts/run-benchmarks.mjs: loads frozen fixture scenarios, runs the existing validator, computes PASS/FAIL/DNF status, normalizes blind ratings, and emits the scoring-spec result shapebenchmarks/scenarios/with fixture artifacts covering hidden scope creep, partial-data pricing, board ambiguity, churn contradictions, weak evidence, and handoff contradictionsschema_version: 2.0.0; each scenario exercises a detection → resolution cycle (first pass surfaces validator issues, final pass resolves them)Test plan
node --test tests/run-benchmarks.test.mjs— 6 tests: suite execution, blind-rating normalization, DNF diagnostic scoring, scenarioIds filtering, invalid scenario loadingnode --test tests/*.test.mjs— 163 tests, all passingnode scripts/run-benchmarks.mjs --format json— expected mix confirmedbash scripts/validate.sh— green🤖 Generated with Claude Code