LLM-based optimizations for BatchCompleter#1293
Open
brandur wants to merge 1 commit into
Open
Conversation
168f657 to
d177425
Compare
1e5df34 to
4a982be
Compare
fb7ef2c to
4acc752
Compare
4a982be to
d165299
Compare
4acc752 to
2bb559c
Compare
This one's just throwing Codex at `BatchCompleter` to see what
optimizations it can find in there as this seems to be by far our
slowest spot in River with respect to benchmarking at least.
It didn't do anything huge, but ended up putting in a variety of minor
optimizations:
- Removed the separate `setStateStartTimes` map by storing `StartTime`
in `batchCompleterSetState`.
- Changed queued state storage from pointer values to map values,
removing a wrapper allocation.
- Collapsed backlog checking + enqueue into one lock acquisition on the
common path.
- Replaced `sliceutil.Map` with a direct event-building loop.
- Use one batch completion timestamp instead of calling `Now()` once per
completed row.
- Avoid repeatedly assigning `params.Schema` inside the batch mapping
loop.
Benchmarks seem to indicate a previous 12-13 us/op coming down to 10-11
us/op, and with one fewer allocation per op.
I'm seeing a definite speedup when I run the full benchmark. Our bench
has some consistency problems (can show considerably different results
per run), but whereas before ~46k jobs/sec was about the best I ever saw
on my commodity MBP here, I'm now seeing up to ~56k jobs/sec, so at best
a 10k jobs/sec improvement:
$ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 1_000_000
bench: jobs worked [ 0 ], inserted [ 1000000 ], job/sec [ 0.0 ] [0s]
bench: jobs worked [ 106472 ], inserted [ 0 ], job/sec [ 53236.0 ] [2s]
bench: jobs worked [ 108440 ], inserted [ 0 ], job/sec [ 54220.0 ] [2s]
bench: jobs worked [ 114035 ], inserted [ 0 ], job/sec [ 57017.5 ] [2s]
bench: jobs worked [ 107402 ], inserted [ 0 ], job/sec [ 53701.0 ] [2s]
bench: jobs worked [ 114433 ], inserted [ 0 ], job/sec [ 57216.5 ] [2s]
bench: jobs worked [ 105701 ], inserted [ 0 ], job/sec [ 52850.5 ] [2s]
bench: jobs worked [ 116051 ], inserted [ 0 ], job/sec [ 58025.5 ] [2s]
bench: jobs worked [ 108054 ], inserted [ 0 ], job/sec [ 54027.0 ] [2s]
bench: jobs worked [ 119412 ], inserted [ 0 ], job/sec [ 59706.0 ] [2s]
bench: total jobs worked [ 1000000 ], total jobs inserted [ 1000000 ], overall job/sec [ 55710.8 ], running 17.949838958s
The number should be even better on faster computers.
Nothing in the code gets any worse (and I think some of it is actually
an improvement?) so I think it's probably worthwhile to bring these in.
d165299 to
9b268fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This one's just throwing Codex at
BatchCompleterto see whatoptimizations it can find in there as this seems to be by far our
slowest spot in River with respect to benchmarking at least.
It didn't do anything huge, but ended up putting in a variety of minor
optimizations:
Removed the separate
setStateStartTimesmap by storingStartTimein
batchCompleterSetState.Changed queued state storage from pointer values to map values,
removing a wrapper allocation.
Collapsed backlog checking + enqueue into one lock acquisition on the
common path.
Replaced
sliceutil.Mapwith a direct event-building loop.Use one batch completion timestamp instead of calling
Now()once percompleted row.
Avoid repeatedly assigning
params.Schemainside the batch mappingloop.
Benchmarks seem to indicate a previous 12-13 us/op coming down to 10-11
us/op, and with one fewer allocation per op.
I'm seeing a definite speedup when I run the full benchmark. Our bench
has some consistency problems (can show considerably different results
per run), but whereas before ~46k jobs/sec was about the best I ever saw
on my commodity MBP here, I'm now seeing up to ~56k jobs/sec, so at best
a 10k jobs/sec improvement:
The number should be even better on faster computers.
Nothing in the code gets any worse (and I think some of it is actually
an improvement?) so I think it's probably worthwhile to bring these in.