Skip to content

DBOS enqueue-only clients: fix dispatcher race in management commands#131

Merged
rafacm merged 3 commits intomainfrom
rafacm/dbos-enqueue-only
May 4, 2026
Merged

DBOS enqueue-only clients: fix dispatcher race in management commands#131
rafacm merged 3 commits intomainfrom
rafacm/dbos-enqueue-only

Conversation

@rafacm
Copy link
Copy Markdown
Owner

@rafacm rafacm commented May 4, 2026

Summary

  • submit_episode and enrich_entities no longer call DBOS.launch() as workers. They now run as clients that launch DBOS with listen_queues([]), so the queue dispatcher in their process iterates over zero application queues.
  • Eliminates the two log lines reproduced on every enrich_entities backfill while uvicorn was running: Contention detected in queue thread for wikidata_enrichment and cannot schedule new futures after shutdown.
  • episodes/apps.py:_init_dbos is now structured around _WORKER_COMMANDS vs _CLIENT_COMMANDS. uvicorn (and runserver) remain the only processes that drain episode_pipeline and wikidata_enrichment.

The issue's recommended Option 1 ("don't call launch() from clients") cannot be implemented as-is — DBOS._sys_db is created inside launch(), and Queue.enqueue calls start_workflow which dereferences _sys_db. listen_queues([]) is the working refinement of the same architectural posture.

Closes #113.

Test plan

  • uv run python manage.py check
  • uv run python manage.py test --noinput (371 tests pass)
  • Manual smoke: uv run uvicorn ragtime.asgi:application --port 8765 in one shell, uv run python manage.py enrich_entities --limit 1 --retry-failed in another. Confirmed the client logs Listening to 0 queues:, exits cleanly, and the workflow row reaches dbos.workflow_status with status ENQUEUED. No Contention detected or cannot schedule new futures after shutdown lines.
  • CI matrix runs — Unit Test Results should show 371 passing tests with this branch's changes.

Copy link
Copy Markdown
Owner Author

@rafacm rafacm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: code change looks directionally correct for #113, but I would not merge until the failing documentation/comment-discipline checks are fixed; I do not see a DBOS-bootstrap correctness blocker.

Correctness

The worker/client split in episodes/apps.py:39-43 is hardcoded from sys.argv: _WORKER_COMMANDS = {"runserver"} plus sys.argv[0] containing uvicorn, and _CLIENT_COMMANDS = {"submit_episode", "enrich_entities"}. That gives the intended parity for both acceptance-criterion commands: submit_episode and enrich_entities enter client mode and call DBOS.listen_queues([]) at episodes/apps.py:75-81, while uvicorn/runserver still get the normal DBOS.launch() worker path at episodes/apps.py:82.

I checked the DBOS source behind the workaround. Queue.enqueue calls start_workflow(..., execute_workflow=False) (dbos/_queue.py:101-104), and start_workflow still reaches dbos._sys_db before returning a polling handle (dbos/_core.py:855-891). _sys_db is only available after launch (dbos/_dbos.py:493-497, initialized at dbos/_dbos.py:548-557), so the PR description is right that pure “instantiate but do not launch” would fail. listen_queues([]) is also being used in the required order: DBOS rejects it after launch (dbos/_dbos.py:2925-2938).

For queue polling semantics, dbos/_queue.py:200-215 confirms the client will log/listen to zero application queues, then add only the internal DBOS queue. That should remove contention on this app’s episode_pipeline and wikidata_enrichment queues, and it removes the original shutdown race path where this one-shot process dequeues app work and submits it to its executor. There is still DBOS background machinery in launched clients: delayed workflow transitions, notification listener, scheduler, startup recovery, and the internal queue (dbos/_dbos.py:597-633, dbos/_queue.py:220-224). I do not see that as a blocker for this issue because it does not poll the app queues that caused the reported race, but it is worth knowing this is not a fully inert DBOS client.

The edge cases behave as expected from the hardcoded split:

  • manage.py test, migrate, check, and other built-ins are neither worker nor client, so _init_dbos returns at episodes/apps.py:44-45. That avoids a test-discovery regression.
  • manage.py shell also skips DBOS. That means a REPL user who imports a queue and calls Queue.enqueue will still hit “DBOS not initialized/launched” behavior. This predates the PR, so I would not block #113 on it.
  • Any future custom management command that enqueues must be added to _CLIENT_COMMANDS; otherwise it will have the same shell behavior. That is a maintainability footgun but not a regression in this PR.

The Failed to start admin server: [Errno 48] Address already in use warning is acceptable to leave out of this fix. It is called out in doc/features/2026-05-04-dbos-enqueue-only-clients.md:59 and scoped out in doc/plans/2026-05-04-dbos-enqueue-only-clients.md:50. It should be tracked separately if the warning is noisy in normal use.

Test coverage

CI unit tests are passing according to gh pr checks 131 (test and Unit Test Results both pass). I also attempted a small local targeted run, but this worktree’s local test DB recreation failed with psycopg.errors.AdminShutdown while destroying/recreating test_ragtime, so I’m relying on CI for runtime confirmation.

Coverage is mostly indirect. Existing tests mock the queues (episodes/tests/test_submit_episode.py:13-28, episodes/tests/test_enrichment.py:171-184, episodes/tests/test_signals.py:8-33) and do not assert the new EpisodesConfig._init_dbos role selection. The PR’s manual smoke test is the main evidence that Queue.enqueue works in client mode and that the contention/shutdown logs are gone. I would prefer a focused unit test that patches dbos.DBOS, simulates sys.argv for submit_episode/enrich_entities/test, and asserts listen_queues([]) is only called for client commands. That is not necessarily merge-blocking given the small code surface and passing CI, but it would protect this exact regression.

Documentation

The required files are present and date-prefixed: plan, feature doc, planning transcript, implementation transcript, and changelog entry. CHANGELOG.md:7-11 links all four documents under ### Fixed, which is the right category for this bug fix.

However, the session transcripts do not satisfy the repository instruction that user messages be verbatim. Both files replace the user content with bracketed summaries instead of the actual prompt text:

  • doc/sessions/2026-05-04-dbos-enqueue-only-clients-planning-session.md:15-17 uses [issue #113 body, full text — ...] rather than the verbatim user message.
  • doc/sessions/2026-05-04-dbos-enqueue-only-clients-implementation-session.md:17-19 uses [Full implementation brief — ...] rather than the verbatim parent-agent prompt.

There is also a failing CI check for the documentation bundle because both transcripts share the same session UUID (doc/sessions/2026-05-04-dbos-enqueue-only-clients-planning-session.md:5, doc/sessions/2026-05-04-dbos-enqueue-only-clients-implementation-session.md:5). If these were actually one continuous session, the docs should say that explicitly and the planning/implementation split should be justified; otherwise one of the IDs needs correcting. In its current state this is a policy/check blocker, not a code blocker.

Other findings

CI currently has two failing AI checks:

  • AI Check: Feature PR Documentation Bundle fails on the duplicated Session ID.
  • AI Check: Comment Discipline fails on the long explanatory comment blocks in episodes/apps.py:24-38 and episodes/apps.py:75-81.

The code comments are not misleading, but the check is right that they can be shorter. The useful non-obvious bit is the DBOS constraint: listen_queues([]) must be used because Queue.enqueue still needs _sys_db from launch().

Suggested follow-ups

  • Fix the transcript metadata and replace bracketed user-message summaries with verbatim prompts before merge.
  • Trim the episodes/apps.py comments enough to satisfy the failing comment-discipline check.
  • Add a small bootstrap-role unit test for submit_episode, enrich_entities, test, migrate, and shell argv cases.
  • Consider a follow-up issue for suppressing or configuring DBOS admin-server startup in enqueue-only clients.
  • Consider making enqueue-capable custom commands opt into client mode via a setting/env var or helper rather than only via _CLIENT_COMMANDS.

— Codex review (independent second-opinion pass)

rafacm added a commit that referenced this pull request May 4, 2026
AGENTS.md's transcript policy assumed the human-with-Claude workflow
where every session has real `### User` messages. Conductor's parallel-
launch model breaks that assumption: the implementation session has no
direct user, only an agent-orchestrated launching prompt.

Add a new "Agent-orchestrated sessions" subsection to AGENTS.md and a
matching note to the Feature PR Documentation Bundle AI check. The new
convention: use `### Parent agent (orchestrator)` headings instead of
`### User`, reproduce the parent-agent prompt verbatim, and declare the
session as agent-orchestrated at the top of `## Detailed conversation`.

Same verbatim rule applies — summarized parent prompts are still
rejected. This eliminates a recurring false-negative on the docs check
that was raised on PRs #129, #130, #131, #133.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rafacm rafacm force-pushed the rafacm/dbos-enqueue-only branch from 8dcae34 to f7afd05 Compare May 4, 2026 12:49
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

AI Checks summary

1 fail · ✅ 5 pass · ⏭️ 0 skip

❌ Comment Discipline

Comments must explain non-obvious WHY. No narration of the current task, no restating WHAT the code does, no rotting references to callers or issues.

Comments in the diff violate the Comment Discipline rule.

Details

There are several comments in the episodes/apps.py file that paraphrase what the code does, which violates the Comment Discipline rules. For example, comments like # Only initialize DBOS for server/worker entrypoints, plus the management commands that need to talk to the queue. and # Decorators must register before DBOS.launch(); enqueues against late-registered workflows silently fail. can be removed because the functionality is clear from the code itself.


Other checks (5 passing · 0 skipped)

Show details
  • Branching & PR Strategy — Rule does not apply.
  • Entity Creation Race Safety — Rule does not apply.
  • Feature PR Documentation Bundle — Rule does not apply.
  • RAGTIME_ Env Var Sync* — Rule does not apply.
  • gh api Shell Escaping & Endpoints — Rule does not apply.

rafacm and others added 3 commits May 4, 2026 15:41
Closes #113.

Both management commands previously called DBOS.launch() and started a
queue dispatcher inside their own process, racing with the uvicorn
worker over `dbos.workflow_status` rows on the same queue and causing
two cosmetic but alarming log lines on every backfill:

  - "Contention detected in queue thread for wikidata_enrichment"
  - "cannot schedule new futures after shutdown"

Split _DBOS_COMMANDS in episodes/apps.py:_init_dbos into a worker set
(runserver, uvicorn) and a client set (submit_episode, enrich_entities).
Clients call DBOS.listen_queues([]) before DBOS.launch() so the
dispatcher iterates over zero application queues.

Pure "skip launch()" (the issue's recommended Option 1) does not work
because DBOS._sys_db is created inside launch() and Queue.enqueue
requires it. listen_queues([]) is the working refinement.
Updates the planning and implementation session transcripts to comply
with the new "Agent-orchestrated sessions" subsection in AGENTS.md
(merged in #136 / 93056e3).

- Planning transcript: replace summarized user sections with verbatim
  text from the parent conversation that authorized this work.
- Implementation transcript: declare the session as agent-orchestrated
  at the top of `## Detailed conversation`, replace the `### User`
  heading with `### Parent agent (orchestrator)`, and reproduce the
  parent agent's launching prompt verbatim.
- Session ID values audited; placeholder `unavailable` used where the
  real UUID is unrecoverable, replacing duplicated/fabricated values.

This is a verification PR for the new convention — the Feature PR
Documentation Bundle AI check should flip from FAIL to PASS after this
commit. The Comment Discipline check (long explanatory blocks in
episodes/apps.py) remains failing and is intentionally deferred to a
separate follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Trim verbose multi-paragraph comments in episodes/apps.py:_init_dbos
  per .ai-checks/comment-discipline.md (one-line WHY only).
- Add test_apps_dbos_init.py covering role detection for uvicorn,
  runserver, submit_episode, enrich_entities, migrate, test, shell.
- Merge duplicate ### Fixed section under 2026-05-04 in CHANGELOG.md
  after rebasing onto main (PR #129's StepFailed fix).
@rafacm rafacm force-pushed the rafacm/dbos-enqueue-only branch from f7afd05 to 98270bc Compare May 4, 2026 13:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Unit Test Results

381 tests  +7   381 ✅ +7   23s ⏱️ ±0s
 80 suites +1     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 98270bc. ± Comparison against base commit 829c2a3.

@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Merging with AI Check: Comment Discipline failing — model hallucination

The verdict cites two comments as needing removal:

There are several comments in the `episodes/apps.py` file that paraphrase what the code does […] For example, comments like `# Only initialize DBOS for server/worker entrypoints, plus the management commands that need to talk to the queue.` and `# Decorators must register before DBOS.launch(); enqueues against late-registered workflows silently fail.` can be removed because the functionality is clear from the code itself.

Two distinct problems with this verdict:

1. The first cited comment was deleted by this PR. It appears in the diff as a - line:

-        # Only initialize DBOS for server/worker entrypoints, plus the
-        # management commands that need to talk to the queue. Other commands
-        # (migrate, check, shell, …) don't need a live DBOS connection.
-        _DBOS_COMMANDS = {"runserver", "submit_episode", "enrich_entities"}
+        _CLIENT_COMMANDS = {"submit_episode", "enrich_entities"}
+        _WORKER_COMMANDS = {"runserver"}

The model is reading - lines as if they're still in the file. Removing what's already removed is not actionable feedback.

2. The second cited comment is a legitimate non-obvious WHY — kept deliberately per the rule's own carve-out. # Decorators must register before DBOS.launch(); enqueues against late-registered workflows silently fail. warns about a silent failure mode of DBOS that you cannot infer from the imports themselves. The Comment Discipline rule explicitly says: "default to no comments; only keep one if WHY is non-obvious." This is exactly that case.

The other retained comment in this diff (# Skipping launch() isn't an option — DBOS._sys_db is created there and Queue.enqueue requires it.) follows the same pattern: it explains why listen_queues([]) is unavoidable, not what the code does.

Net of the PR: 14 lines of WHAT-paraphrasing comments removed, 2 single-line WHY comments retained, plus several # noqa: F401 trailing explanations stripped. This is rule-compliant.

All other signals pass: Unit Test Results (343 tests, 7 new), all 5 other AI Checks, aggregate, list, test. Merging via rebase strategy.

This is one of three reproducible AI Checks driver hallucination patterns from the 2026-05-04 PR batch — tracked in #137 alongside #129's synthetic-merge-commit pattern and #133's invented-env-var pattern.

@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Manual test plan (PR #131)

Use this comment to track manual verification of the role-gated DBOS init. Tick boxes as you run each test.

Setup

git worktree add /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test rafacm/dbos-enqueue-only
cd /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test
uv sync

What this PR does

Splits DBOS initialization into three roles based on sys.argv:

Role Triggered by DBOS init
Worker uvicorn, runserver full init — registers workflows, calls DBOS.launch(), listens on queues
Client submit_episode, enrich_entities listen_queues([]) before launch() — can enqueue, won't pull
Bootstrap test, migrate, shell, makemigrations, check, … early return — no DBOS at all

The footgun being fixed: pre-PR, running submit_episode from the CLI while a worker was running would cause both processes to pull work off the queue, racing on the same episodes.

Verifications (ascending realism)

1. Unit-level role-gating coverage

  • Run the new unit tests:
    uv run python manage.py test episodes.tests.test_apps_dbos_init -v 2
  • All 7 tests pass. They patch sys.argv and dbos.DBOS to assert each role takes the right init path:
    • uvicorn, runserverDBOS() + launch() only
    • submit_episode, enrich_entitiesDBOS() + listen_queues([]) + launch()
    • migrate, test, shell → none of DBOS() / launch() / listen_queues called

2. Bootstrap commands skip DBOS entirely

  • uv run python manage.py check runs without any DBOS log output (no "DBOS launched", no queue-listener messages).
  • uv run python manage.py migrate --check runs without any DBOS log output.
  • uv run python manage.py shell -c \"print('ok')\" runs without any DBOS log output.

3. Worker path (full ASGI)

  • Start the worker in terminal Move project instructions to AGENTS.md #1:
    uv run uvicorn ragtime.asgi:application --host 127.0.0.1 --port 8000
  • Logs show DBOS initializing (workflow / step decorators registered, DBOS launched, queue listener active).
  • No listen_queues([]) short-circuit appears in worker mode.

4. Client path with worker running (the actual bug fix)

  • Worker still running in terminal Move project instructions to AGENTS.md #1. In terminal Update README Features with emojis and Episode Indexing #2:
    cd /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test
    uv run python manage.py submit_episode \"https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/\"
  • submit_episode exits promptly after enqueueing (does not hang waiting for queue work).
  • In http://127.0.0.1:8000/admin/episodes/episode/, the submitted episode advances through queued → fetching_details → … → ready.
  • The processing logs come from the worker process (terminal Move project instructions to AGENTS.md #1) only — submit_episode itself does not log pipeline-step execution. (Pre-PR, both processes would race on the queue.)

5. Client without worker (sanity check)

  • Stop the worker. Run submit_episode again (any URL) — it should still enqueue successfully and exit promptly. The episode will sit in queued until a worker comes back up. (Confirms client-mode listen_queues([]) is enough to enqueue without a live worker.)

Cleanup

git worktree remove /Users/rafa/conductor/workspaces/ragtime/dbos-enqueue-test

Notes

@rafacm rafacm merged commit d6cfd82 into main May 4, 2026
9 of 10 checks passed
@rafacm rafacm deleted the rafacm/dbos-enqueue-only branch May 4, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

manage.py enrich_entities should be a DBOS client, not a worker

1 participant