Skip to content

Decouple SSE prefill keepalive from progress callback#224

Open
njbrake wants to merge 1 commit into
antirez:mainfrom
njbrake:fix/prefill-keepalive-wallclock
Open

Decouple SSE prefill keepalive from progress callback#224
njbrake wants to merge 1 commit into
antirez:mainfrom
njbrake:fix/prefill-keepalive-wallclock

Conversation

@njbrake
Copy link
Copy Markdown

@njbrake njbrake commented May 22, 2026

This would unblock the issues that I was seeing but I'm also curious if you want this to be a configurable parameter to enable/disable, or if you would like me to keep this in my own fork instead of merging into your repo.

Note: this PR description was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose is Claude's.

Fixes #222.

Summary

The current keepalive only fires from inside the prefill_chunk progress callback. When a single prefill chunk stalls internally for many minutes (observed: 962s spent on the last 107 tokens of
a 1046 token prefill), no progress callback fires, no : prefill\n\n comment goes out, and clients with body-idle timeouts drop the socket. The server then logs client stream write failed
when it tries to emit the first gen token.

This change adds a small wall-clock keepalive thread that runs alongside prefill, ticks roughly once a second, and writes : prefill\n\n whenever 5 seconds have elapsed since the last keepalive
event, regardless of whether progress callbacks are firing. The thread also handles the initial SSE headers if the callback has not done so yet. The thread and the callback share a mutex so fd
writes and the shared headers_sent / stream_failed / last_keepalive fields never interleave.

Attached and detached around every ds4_session_sync call site: main prefill, cold-prefix prefill, and tool-checkpoint rebuild.

The existing keepalive only fired from inside the prefill_chunk progress
callback, so a stall inside a single chunk could leave the socket silent
for many minutes (observed: 962s spent on the last 107 tokens of a 1046
token prefill). Clients with body-idle timeouts drop the connection, and
the final stream write fails with "client stream write failed".

Add a small wall-clock keepalive thread that runs alongside prefill,
ticks roughly once a second, and writes ": prefill\n\n" whenever 5s
have elapsed since the last keepalive event, regardless of whether
progress callbacks are firing. The thread also sends the SSE headers
if the callback has not done so yet. The thread and the callback share
a mutex so fd writes and shared state (headers_sent, stream_failed,
last_keepalive) never interleave.

Attached and detached around every ds4_session_sync call site: the
main prefill, the cold-prefix prefill, and the tool checkpoint rebuild.

Fixes antirez#222
@njbrake njbrake changed the title ds4-server: emit SSE keepalive on wall-clock timer Decouple SSE prefill keepalive from progress callback May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: SSE keepalive misses prefill stalls because the comment is emitted from the progress callback, not a wall-clock timer

1 participant