Continuous batching#230
Open
Chida82 wants to merge 5 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I started exploring the continuous batching approach, since I sometimes use multiple Pi agent sessions or multi-agent setups.
This PR contains an initial draft with tests that verify that responses were previously processed sequentially, while now they are no longer strictly sequential.
What I observed is that, with 2 or 3 concurrent requests, there is about a 5% gain when comparing the TPS of a single request with the combined TPS of the concurrent requests.
However, more aggressive changes did not bring any additional improvement.
I stopped here because I started wondering whether, from a product perspective, increasing the complexity of the code and execution flow by introducing the concept of continuous batching is actually justified by this gain. In other words, I am not sure whether keeping requests strictly sequential is a requirement of the project.
To test it, simply run:
./ds4-server --batching --batch-size 3
I’d appreciate your feedback. I’m also including a Markdown document that explains this first part in more detail.
Continuous Batching
This note documents the first DS4 server continuous batching path added for
concurrent streaming requests.
Goal
The previous server worker owned one live
ds4_sessionand ran each queued jobto completion before starting the next one. Client sockets could be concurrent,
but model execution and streamed bytes were sequential: a second request would
not emit its first byte until the first request had finished.
The new mode is opt-in with:
It keeps the default behavior unchanged when
--batchingis not passed.Reference Points
The design borrows only the scheduler shape from the reference projects, not
their internal tensor abstractions.
From vLLM:
step is complete.
From llama.cpp:
clients when capacity exists;
broad framework.
DS4 Integration
The implementation is isolated in
ds4_server.cnear the existing queue worker.The old
worker_main()path is still used unlessserver.batchingis true.New server fields:
batching: enables the batched worker path;batch_size: maximum number of active batched requests, default2.New CLI flags:
--batching--batch-size NThe first implementation supports the safest request class only:
/v1/chat/completionsor/v1/completions;Unsupported requests automatically fall back to the existing sequential
generate_job()path inside the same worker, so compatibility remains tied tothe old implementation for complex APIs.
Runtime Shape
Each active batched request gets its own
ds4_session. The worker remains asingle scheduler thread, so it does not share mutable session state across client
threads. The loop is:
send the SSE headers;
batch_size;When
--batchingis enabled, deterministic duplicate streaming requests arealso coalesced onto the first matching active decode. The server compares the
prompt tokens, sampling parameters, stop lists, API shape, model, and stream
options; if they match, the later clients receive their own SSE stream and id,
but the model runs only once. This is a serving-layer optimization inspired by
vLLM-style request sharing: it does not change outputs and it is limited to
requests that are safe to fan out.
This is continuous batching at the server scheduling layer. It does not yet
combine multiple sequences into one backend kernel launch. That deeper backend
batching would require a DS4 engine API that can evaluate multiple sequence IDs
in one graph call. The current path deliberately avoids exposing backend
internals and keeps the change small enough to validate with the live server
tests.
Non-Batching Behavior
Without
--batching, the code path remains:The existing shared live cache, disk KV cache, tool replay, Responses live
continuation, Anthropic continuation, thinking checkpoint, and structured stream
logic all stay on the old path.
With
--batching, only simple OpenAI streaming requests take the new batchedpath. Complex requests are still handled by
generate_job().Test Coverage
The server test group now enables batching in the concurrent live test. The key
assertion is that request 2 emits a first byte before request 1 has emitted its
last byte:
The validation command used after implementation was:
The successful run logged:
That shows the second stream started about 6.9 seconds before the first stream
finished.
A live curl benchmark with three identical deterministic streaming requests and
--batching --batch-size 3completed all three in about the same wall time as asingle request (
6.89ssingle,6.86-6.87sfor each concurrent client), withthe server logging
batching coalesced duplicate ... fanout=3.Current Limits And Next Step
This first version prioritizes correctness, isolation, and observable concurrent
streaming. It intentionally does not batch tools, Anthropic, Responses,
thinking-mode streams, disk KV reuse, or multi-sequence backend kernels. The
duplicate-request coalescing optimization is deliberately conservative: it only
helps repeated deterministic calls that can safely share one generated token
stream.
The next meaningful step is an engine-level multi-sequence decode API. Once DS4
can evaluate several active sequence frontiers in one backend call, the server
scheduler can keep the same waiting/running shape and replace per-request step
evaluation with true backend token batching.