Skip to content

Continuous batching#230

Open
Chida82 wants to merge 5 commits into
antirez:mainfrom
Chida82:concurrencyrequest
Open

Continuous batching#230
Chida82 wants to merge 5 commits into
antirez:mainfrom
Chida82:concurrencyrequest

Conversation

@Chida82
Copy link
Copy Markdown

@Chida82 Chida82 commented May 23, 2026

I started exploring the continuous batching approach, since I sometimes use multiple Pi agent sessions or multi-agent setups.

This PR contains an initial draft with tests that verify that responses were previously processed sequentially, while now they are no longer strictly sequential.

What I observed is that, with 2 or 3 concurrent requests, there is about a 5% gain when comparing the TPS of a single request with the combined TPS of the concurrent requests.

However, more aggressive changes did not bring any additional improvement.

I stopped here because I started wondering whether, from a product perspective, increasing the complexity of the code and execution flow by introducing the concept of continuous batching is actually justified by this gain. In other words, I am not sure whether keeping requests strictly sequential is a requirement of the project.

To test it, simply run:

./ds4-server --batching --batch-size 3

I’d appreciate your feedback. I’m also including a Markdown document that explains this first part in more detail.


Continuous Batching

This note documents the first DS4 server continuous batching path added for
concurrent streaming requests.

Goal

The previous server worker owned one live ds4_session and ran each queued job
to completion before starting the next one. Client sockets could be concurrent,
but model execution and streamed bytes were sequential: a second request would
not emit its first byte until the first request had finished.

The new mode is opt-in with:

./ds4-server --batching

It keeps the default behavior unchanged when --batching is not passed.

Reference Points

The design borrows only the scheduler shape from the reference projects, not
their internal tensor abstractions.

From vLLM:

  • keep separate waiting and running request sets;
  • admit waiting work while other requests are already decoding;
  • schedule in small steps so outputs can be produced per request as soon as a
    step is complete.

From llama.cpp:

  • use a continuous batching loop that decodes active clients, then admits new
    clients when capacity exists;
  • keep a small explicit concurrency limit;
  • prefer a simple, observable scheduler loop over hiding request state behind a
    broad framework.

DS4 Integration

The implementation is isolated in ds4_server.c near the existing queue worker.
The old worker_main() path is still used unless server.batching is true.

New server fields:

  • batching: enables the batched worker path;
  • batch_size: maximum number of active batched requests, default 2.

New CLI flags:

  • --batching
  • --batch-size N

The first implementation supports the safest request class only:

  • OpenAI-compatible /v1/chat/completions or /v1/completions;
  • streaming responses;
  • no tools;
  • no thinking mode.

Unsupported requests automatically fall back to the existing sequential
generate_job() path inside the same worker, so compatibility remains tied to
the old implementation for complex APIs.

Runtime Shape

Each active batched request gets its own ds4_session. The worker remains a
single scheduler thread, so it does not share mutable session state across client
threads. The loop is:

  1. dequeue a request, blocking only when there are no active requests;
  2. if the request is supported, create a per-request session, prefill it, and
    send the SSE headers;
  3. admit more queued requests up to batch_size;
  4. run one decode step per active request;
  5. stream any newly safe UTF-8 text for that request;
  6. finish, clean up the per-request session, and signal the client thread;
  7. keep admitting new work while existing requests are still decoding.

When --batching is enabled, deterministic duplicate streaming requests are
also coalesced onto the first matching active decode. The server compares the
prompt tokens, sampling parameters, stop lists, API shape, model, and stream
options; if they match, the later clients receive their own SSE stream and id,
but the model runs only once. This is a serving-layer optimization inspired by
vLLM-style request sharing: it does not change outputs and it is limited to
requests that are safe to fan out.

This is continuous batching at the server scheduling layer. It does not yet
combine multiple sequences into one backend kernel launch. That deeper backend
batching would require a DS4 engine API that can evaluate multiple sequence IDs
in one graph call. The current path deliberately avoids exposing backend
internals and keeps the change small enough to validate with the live server
tests.

Non-Batching Behavior

Without --batching, the code path remains:

client thread -> enqueue -> single worker -> generate_job() -> done

The existing shared live cache, disk KV cache, tool replay, Responses live
continuation, Anthropic continuation, thinking checkpoint, and structured stream
logic all stay on the old path.

With --batching, only simple OpenAI streaming requests take the new batched
path. Complex requests are still handled by generate_job().

Test Coverage

The server test group now enables batching in the concurrent live test. The key
assertion is that request 2 emits a first byte before request 1 has emitted its
last byte:

req1_last_byte_ts > req2_first_byte_ts

The validation command used after implementation was:

make ds4_test && ./ds4_test --server >/tmp/ds4_server_test.log 2>&1

The successful run logged:

ds4-server: continuous batching enabled batch_size=2
ds4-server: batching start chat prompt=571 max=512 active_limit=2
ds4-server: batching start chat prompt=571 max=512 active_limit=2
ds4-test: concurrent compare req_gap_ms=500 req1_last_byte_ts=1779470996.679727 req2_first_byte_ts=1779470989.736638 sequential=0
server: OK

That shows the second stream started about 6.9 seconds before the first stream
finished.

A live curl benchmark with three identical deterministic streaming requests and
--batching --batch-size 3 completed all three in about the same wall time as a
single request (6.89s single, 6.86-6.87s for each concurrent client), with
the server logging batching coalesced duplicate ... fanout=3.

Current Limits And Next Step

This first version prioritizes correctness, isolation, and observable concurrent
streaming. It intentionally does not batch tools, Anthropic, Responses,
thinking-mode streams, disk KV reuse, or multi-sequence backend kernels. The
duplicate-request coalescing optimization is deliberately conservative: it only
helps repeated deterministic calls that can safely share one generated token
stream.

The next meaningful step is an engine-level multi-sequence decode API. Once DS4
can evaluate several active sequence frontiers in one backend call, the server
scheduler can keep the same waiting/running shape and replace per-request step
evaluation with true backend token batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant