[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG by OnlyJundong · Pull Request #1005 · openai/parameter-golf

OnlyJundong · 2026-03-28T05:40:07Z

Summary

This submission is a non-record submission. It studies how the current record-track SOTA (PR #549 by @abaybektursun) scales under extended compute, removing the 10-minute wall-clock constraint. The same architecture and code are trained for 20K–50K steps (5.5-11.5 hours training) on 4×A100 MIG instances (approximately 10× slower per step than 8×H100 SXM).

Results

Best run: 50K steps and 11.5 hours (4×A100 MIG, seed 1337)

Phase	val_loss	val_bpb	Artifact
Pre-TTT (EMA)	1.8469	1.0939	14,348,646
Int6 roundtrip	1.8963	1.1231	14,348,646
Sliding window (s=64)	1.8566	1.0996	14,348,646
Legal TTT	1.8325	1.0853	14,348,646

20K steps and 5.5 hours (4×A100 MIG, 2-seed comparison)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	Artifact
1337	828.7ms	20,000	1.1018	1.0957	-0.0061	15,077,933
42	828.8ms	20,000	1.1020	1.0962	-0.0058	15,137,145
Mean	828.8ms	20,000	1.1019	1.0960 (std 0.0004)	-0.0060	15,107,539

Plots

BPB vs Steps (ASCII plot)

BPB
4.10 |*
     |
     |
     |
2.50 |
     |
1.26 | *
1.23 |  *
1.22 |   *
1.20 |    *
1.19 |     * * * * *
1.18 |             * *
1.17 |               *
1.16 |                *
1.15 |                 *
1.13 |                  *
1.12 |                   *
1.09 |                    *
     +----+----+----+----+----+-> steps (K)
     0   10   20   30   40   50

     |<early >|<--- plateau --->|<warmdown>|
      (rapid)                    (sharp drop)

Artifact Size vs Steps (ASCII plot)

MB
17.2 |         * * * * * * * * * * * *
16.8 |                               *
16.4 |                                *
16.0 |------------------------------------*--------  16MB limit
15.7 |                                    *
15.1 |                                     *
14.7 |     *                                *
14.1 |  *                                    *
13.1 | *
 4.6 |*
     +----+----+----+----+----+-> steps (K)
     0   10   20   30   40   50

     |<-fits->|<--- OVER 16MB ------>|<fits->|

…50K steps (11.5 hours) on 4xA100 MIG

non-record: extended compute scaling analysis 20K steps (5.5 hours) -…

6d97a05

…50K steps (11.5 hours) on 4xA100 MIG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG#1005

[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG#1005
OnlyJundong wants to merge 1 commit intoopenai:mainfrom
OnlyJundong:nonrecord/extended-compute-scaling-50k

OnlyJundong commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OnlyJundong commented Mar 28, 2026

Summary

Results

Best run: 50K steps and 11.5 hours (4×A100 MIG, seed 1337)

20K steps and 5.5 hours (4×A100 MIG, 2-seed comparison)

Plots

BPB vs Steps (ASCII plot)

Artifact Size vs Steps (ASCII plot)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant