DBOS enqueue-only clients: fix dispatcher race in management commands#131
DBOS enqueue-only clients: fix dispatcher race in management commands#131
Conversation
rafacm
left a comment
There was a problem hiding this comment.
Verdict: code change looks directionally correct for #113, but I would not merge until the failing documentation/comment-discipline checks are fixed; I do not see a DBOS-bootstrap correctness blocker.
Correctness
The worker/client split in episodes/apps.py:39-43 is hardcoded from sys.argv: _WORKER_COMMANDS = {"runserver"} plus sys.argv[0] containing uvicorn, and _CLIENT_COMMANDS = {"submit_episode", "enrich_entities"}. That gives the intended parity for both acceptance-criterion commands: submit_episode and enrich_entities enter client mode and call DBOS.listen_queues([]) at episodes/apps.py:75-81, while uvicorn/runserver still get the normal DBOS.launch() worker path at episodes/apps.py:82.
I checked the DBOS source behind the workaround. Queue.enqueue calls start_workflow(..., execute_workflow=False) (dbos/_queue.py:101-104), and start_workflow still reaches dbos._sys_db before returning a polling handle (dbos/_core.py:855-891). _sys_db is only available after launch (dbos/_dbos.py:493-497, initialized at dbos/_dbos.py:548-557), so the PR description is right that pure “instantiate but do not launch” would fail. listen_queues([]) is also being used in the required order: DBOS rejects it after launch (dbos/_dbos.py:2925-2938).
For queue polling semantics, dbos/_queue.py:200-215 confirms the client will log/listen to zero application queues, then add only the internal DBOS queue. That should remove contention on this app’s episode_pipeline and wikidata_enrichment queues, and it removes the original shutdown race path where this one-shot process dequeues app work and submits it to its executor. There is still DBOS background machinery in launched clients: delayed workflow transitions, notification listener, scheduler, startup recovery, and the internal queue (dbos/_dbos.py:597-633, dbos/_queue.py:220-224). I do not see that as a blocker for this issue because it does not poll the app queues that caused the reported race, but it is worth knowing this is not a fully inert DBOS client.
The edge cases behave as expected from the hardcoded split:
manage.py test,migrate,check, and other built-ins are neither worker nor client, so_init_dbosreturns atepisodes/apps.py:44-45. That avoids a test-discovery regression.manage.py shellalso skips DBOS. That means a REPL user who imports a queue and callsQueue.enqueuewill still hit “DBOS not initialized/launched” behavior. This predates the PR, so I would not block #113 on it.- Any future custom management command that enqueues must be added to
_CLIENT_COMMANDS; otherwise it will have the same shell behavior. That is a maintainability footgun but not a regression in this PR.
The Failed to start admin server: [Errno 48] Address already in use warning is acceptable to leave out of this fix. It is called out in doc/features/2026-05-04-dbos-enqueue-only-clients.md:59 and scoped out in doc/plans/2026-05-04-dbos-enqueue-only-clients.md:50. It should be tracked separately if the warning is noisy in normal use.
Test coverage
CI unit tests are passing according to gh pr checks 131 (test and Unit Test Results both pass). I also attempted a small local targeted run, but this worktree’s local test DB recreation failed with psycopg.errors.AdminShutdown while destroying/recreating test_ragtime, so I’m relying on CI for runtime confirmation.
Coverage is mostly indirect. Existing tests mock the queues (episodes/tests/test_submit_episode.py:13-28, episodes/tests/test_enrichment.py:171-184, episodes/tests/test_signals.py:8-33) and do not assert the new EpisodesConfig._init_dbos role selection. The PR’s manual smoke test is the main evidence that Queue.enqueue works in client mode and that the contention/shutdown logs are gone. I would prefer a focused unit test that patches dbos.DBOS, simulates sys.argv for submit_episode/enrich_entities/test, and asserts listen_queues([]) is only called for client commands. That is not necessarily merge-blocking given the small code surface and passing CI, but it would protect this exact regression.
Documentation
The required files are present and date-prefixed: plan, feature doc, planning transcript, implementation transcript, and changelog entry. CHANGELOG.md:7-11 links all four documents under ### Fixed, which is the right category for this bug fix.
However, the session transcripts do not satisfy the repository instruction that user messages be verbatim. Both files replace the user content with bracketed summaries instead of the actual prompt text:
doc/sessions/2026-05-04-dbos-enqueue-only-clients-planning-session.md:15-17uses[issue #113 body, full text — ...]rather than the verbatim user message.doc/sessions/2026-05-04-dbos-enqueue-only-clients-implementation-session.md:17-19uses[Full implementation brief — ...]rather than the verbatim parent-agent prompt.
There is also a failing CI check for the documentation bundle because both transcripts share the same session UUID (doc/sessions/2026-05-04-dbos-enqueue-only-clients-planning-session.md:5, doc/sessions/2026-05-04-dbos-enqueue-only-clients-implementation-session.md:5). If these were actually one continuous session, the docs should say that explicitly and the planning/implementation split should be justified; otherwise one of the IDs needs correcting. In its current state this is a policy/check blocker, not a code blocker.
Other findings
CI currently has two failing AI checks:
AI Check: Feature PR Documentation Bundlefails on the duplicated Session ID.AI Check: Comment Disciplinefails on the long explanatory comment blocks inepisodes/apps.py:24-38andepisodes/apps.py:75-81.
The code comments are not misleading, but the check is right that they can be shorter. The useful non-obvious bit is the DBOS constraint: listen_queues([]) must be used because Queue.enqueue still needs _sys_db from launch().
Suggested follow-ups
- Fix the transcript metadata and replace bracketed user-message summaries with verbatim prompts before merge.
- Trim the
episodes/apps.pycomments enough to satisfy the failing comment-discipline check. - Add a small bootstrap-role unit test for
submit_episode,enrich_entities,test,migrate, andshellargv cases. - Consider a follow-up issue for suppressing or configuring DBOS admin-server startup in enqueue-only clients.
- Consider making enqueue-capable custom commands opt into client mode via a setting/env var or helper rather than only via
_CLIENT_COMMANDS.
— Codex review (independent second-opinion pass)
AGENTS.md's transcript policy assumed the human-with-Claude workflow where every session has real `### User` messages. Conductor's parallel- launch model breaks that assumption: the implementation session has no direct user, only an agent-orchestrated launching prompt. Add a new "Agent-orchestrated sessions" subsection to AGENTS.md and a matching note to the Feature PR Documentation Bundle AI check. The new convention: use `### Parent agent (orchestrator)` headings instead of `### User`, reproduce the parent-agent prompt verbatim, and declare the session as agent-orchestrated at the top of `## Detailed conversation`. Same verbatim rule applies — summarized parent prompts are still rejected. This eliminates a recurring false-negative on the docs check that was raised on PRs #129, #130, #131, #133. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8dcae34 to
f7afd05
Compare
AI Checks summary❌ 1 fail · ✅ 5 pass · ⏭️ 0 skip ❌ Comment Discipline
Comments in the diff violate the Comment Discipline rule. DetailsThere are several comments in the Other checks (5 passing · 0 skipped)Show details
|
Closes #113. Both management commands previously called DBOS.launch() and started a queue dispatcher inside their own process, racing with the uvicorn worker over `dbos.workflow_status` rows on the same queue and causing two cosmetic but alarming log lines on every backfill: - "Contention detected in queue thread for wikidata_enrichment" - "cannot schedule new futures after shutdown" Split _DBOS_COMMANDS in episodes/apps.py:_init_dbos into a worker set (runserver, uvicorn) and a client set (submit_episode, enrich_entities). Clients call DBOS.listen_queues([]) before DBOS.launch() so the dispatcher iterates over zero application queues. Pure "skip launch()" (the issue's recommended Option 1) does not work because DBOS._sys_db is created inside launch() and Queue.enqueue requires it. listen_queues([]) is the working refinement.
Updates the planning and implementation session transcripts to comply with the new "Agent-orchestrated sessions" subsection in AGENTS.md (merged in #136 / 93056e3). - Planning transcript: replace summarized user sections with verbatim text from the parent conversation that authorized this work. - Implementation transcript: declare the session as agent-orchestrated at the top of `## Detailed conversation`, replace the `### User` heading with `### Parent agent (orchestrator)`, and reproduce the parent agent's launching prompt verbatim. - Session ID values audited; placeholder `unavailable` used where the real UUID is unrecoverable, replacing duplicated/fabricated values. This is a verification PR for the new convention — the Feature PR Documentation Bundle AI check should flip from FAIL to PASS after this commit. The Comment Discipline check (long explanatory blocks in episodes/apps.py) remains failing and is intentionally deferred to a separate follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Trim verbose multi-paragraph comments in episodes/apps.py:_init_dbos per .ai-checks/comment-discipline.md (one-line WHY only). - Add test_apps_dbos_init.py covering role detection for uvicorn, runserver, submit_episode, enrich_entities, migrate, test, shell. - Merge duplicate ### Fixed section under 2026-05-04 in CHANGELOG.md after rebasing onto main (PR #129's StepFailed fix).
f7afd05 to
98270bc
Compare
Merging with
|
Manual test plan (PR #131)Use this comment to track manual verification of the role-gated DBOS init. Tick boxes as you run each test. Setupgit worktree add /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test rafacm/dbos-enqueue-only
cd /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test
uv syncWhat this PR doesSplits DBOS initialization into three roles based on
The footgun being fixed: pre-PR, running Verifications (ascending realism)1. Unit-level role-gating coverage
2. Bootstrap commands skip DBOS entirely
3. Worker path (full ASGI)
4. Client path with worker running (the actual bug fix)
5. Client without worker (sanity check)
Cleanupgit worktree remove /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-testNotes
|
Summary
submit_episodeandenrich_entitiesno longer callDBOS.launch()as workers. They now run as clients that launch DBOS withlisten_queues([]), so the queue dispatcher in their process iterates over zero application queues.enrich_entitiesbackfill while uvicorn was running:Contention detected in queue thread for wikidata_enrichmentandcannot schedule new futures after shutdown.episodes/apps.py:_init_dbosis now structured around_WORKER_COMMANDSvs_CLIENT_COMMANDS. uvicorn (andrunserver) remain the only processes that drainepisode_pipelineandwikidata_enrichment.The issue's recommended Option 1 ("don't call launch() from clients") cannot be implemented as-is —
DBOS._sys_dbis created insidelaunch(), andQueue.enqueuecallsstart_workflowwhich dereferences_sys_db.listen_queues([])is the working refinement of the same architectural posture.Closes #113.
Test plan
uv run python manage.py checkuv run python manage.py test --noinput(371 tests pass)uv run uvicorn ragtime.asgi:application --port 8765in one shell,uv run python manage.py enrich_entities --limit 1 --retry-failedin another. Confirmed the client logsListening to 0 queues:, exits cleanly, and the workflow row reachesdbos.workflow_statuswith statusENQUEUED. NoContention detectedorcannot schedule new futures after shutdownlines.Unit Test Resultsshould show 371 passing tests with this branch's changes.