Skip to content

ds4-eval (fix): q13 provides wrong answer#233

Merged
antirez merged 1 commit into
antirez:mainfrom
alantsev:eval-fix
May 24, 2026
Merged

ds4-eval (fix): q13 provides wrong answer#233
antirez merged 1 commit into
antirez:mainfrom
alantsev:eval-fix

Conversation

@alantsev
Copy link
Copy Markdown
Contributor

The original answer was outside of the claimed energy precision.

The model tries really hard to get the answer - which is impossible.

This PR changes the answer to the correct value.

The evaluation after the fix (using smooth distribution over the tokens configuration)

$ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196)
ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      733      934 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       87      236 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      574      655 70       70       AIME2025/aime2025-01
  4 PASSED        313      239      552 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      177      449 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1140     1286 468      468      AIME2025/aime2025-16
  7 PASSED        156      646      802 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       52      179 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     4780     5413 588      588      AIME2025/aime2025-02
 10 PASSED        182      322      504 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137       68      205 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165      747      912 16       16       AIME2025/aime2025-03
 13 PASSED        149      672      821 A        A        GPQA Diamond (modified)/recDytVnNYZe2HuUU
 14 PASSED        167       68      235 J        J        SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a
 15 FAILED        305     4837     5142 86       82       AIME2025/aime2025-18
 16 PASSED        131      671      802 D        D        GPQA Diamond/recNFJjE5PPTqVJGv
 17 PASSED        175       67      242 I        I        SuperGPQA/7ca71b86327744b78e93185a45bc5cef
 18 PASSED        102     1199     1301 117      117      AIME2025/aime2025-04
 19 STOPPED       187       80      267 -        B        GPQA Diamond/rec2UlKqC6RFHdcro
 20 PENDING         0        0        0 -        E        SuperGPQA/d44b94f7749345a39a65f6312bda8764
 21 PENDING         0        0        0 -        106      AIME2025/aime2025-19
 22 PENDING         0        0        0 -        B        GPQA Diamond/recv7GsQg3f0fvB1f
 23 PENDING         0        0        0 -        B        SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1

the answer was outside of the claimed energy precision.

the evaluation after the fix
(with smooth distribution over the tokens)

```
$ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196)
ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      733      934 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       87      236 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      574      655 70       70       AIME2025/aime2025-01
  4 PASSED        313      239      552 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      177      449 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1140     1286 468      468      AIME2025/aime2025-16
  7 PASSED        156      646      802 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       52      179 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     4780     5413 588      588      AIME2025/aime2025-02
 10 PASSED        182      322      504 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137       68      205 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165      747      912 16       16       AIME2025/aime2025-03
 13 PASSED        149      672      821 A        A        GPQA Diamond (modified)/recDytVnNYZe2HuUU
 14 PASSED        167       68      235 J        J        SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a
 15 FAILED        305     4837     5142 86       82       AIME2025/aime2025-18
 16 PASSED        131      671      802 D        D        GPQA Diamond/recNFJjE5PPTqVJGv
 17 PASSED        175       67      242 I        I        SuperGPQA/7ca71b86327744b78e93185a45bc5cef
 18 PASSED        102     1199     1301 117      117      AIME2025/aime2025-04
 19 STOPPED       187       80      267 -        B        GPQA Diamond/rec2UlKqC6RFHdcro
 20 PENDING         0        0        0 -        E        SuperGPQA/d44b94f7749345a39a65f6312bda8764
 21 PENDING         0        0        0 -        106      AIME2025/aime2025-19
 22 PENDING         0        0        0 -        B        GPQA Diamond/recv7GsQg3f0fvB1f
 23 PENDING         0        0        0 -        B        SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1
```
@antirez antirez merged commit e3efafe into antirez:main May 24, 2026
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 24, 2026

Thanks, there are a lot broken entires, I had to remove several when I started the eval thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants