Skip to content

Decouple handler fanout/scheduling to support per-handler backoff (vs event-level backoff_until) #79

@dillonstreator

Description

@dillonstreator

Problem

Today txob tracks per-handler completion state in handler_results, but retry/backoff is event-scoped via a single events.backoff_until timestamp.

In src/processor.ts, when any handler errors the processor computes a backoff list:

  • per-handler custom backoff via TxOBError.backoffUntil
  • plus the default backoff(event.errors)

…and then sets:

  • lockedEvent.backoff_until = max(backoffs)

Because this is a single column, one slow/rate-limited handler can delay reprocessing for all other remaining handlers (even if those other handlers could run sooner with a different backoff policy).

We want to decouple handler processing / fanout scheduling so we can configure:

  • handler-specific backoff strategies (e.g. webhook handler vs email handler)
  • handler-specific retry counters / max errors
  • (optionally) handler-specific concurrency/priority in the future

Current behavior (references)

  • TxOBError explicitly documents that the processor uses the latest (maximum) backoff among handlers: src/error.ts.
  • processEvent() collects backoffs and sets event-level backoff_until: src/processor.ts.
  • The canonical schema is a single events table with handler_results JSONB + backoff_until TIMESTAMPTZ: README.md.

Goals / success criteria

  • Per-handler backoff without forcing unrelated handlers to wait.
  • Keep at-least-once semantics and existing handler idempotency story.
  • Preserve good query performance (indexable “due work” query).
  • Prefer additive/backwards-compatible migration paths where possible.

Design options

Option A (recommended): Separate table for handler work items

Introduce a new table that materializes handler fanout and scheduling:

  • event_handlers (or event_handler_jobs)
    • event_id (FK)
    • event_type
    • handler_name
    • status (pending|processed|unprocessable)
    • attempts (or errors)
    • backoff_until TIMESTAMPTZ NULL
    • processed_at TIMESTAMPTZ NULL
    • unprocessable_at TIMESTAMPTZ NULL
    • last_error (optional) / error_history (optional)
    • timestamps

Processing model:

  • Poll/query due handler rows:
    • processed_at IS NULL
    • unprocessable_at IS NULL
    • backoff_until IS NULL OR backoff_until < now()
    • attempts < maxAttempts(handler)
  • Lock row FOR UPDATE SKIP LOCKED and execute one handler.
  • Update only that handler row (its backoff, attempts, status).
  • Mark the parent events.processed_at when all handler rows are done (processed or unprocessable) OR when a policy says to stop.

Indexes:

  • (processed_at, unprocessable_at, backoff_until, attempts) with a partial index where processed_at IS NULL AND unprocessable_at IS NULL.

Pros:

  • True per-handler scheduling/backoff with an index-friendly due-work query.
  • Clean foundation for future features (priorities, per-handler concurrency, dead-lettering per handler).

Cons:

  • Requires schema changes + migration story.
  • Requires defining how/when to create handler rows (on enqueue vs on first processing attempt).

Option B: Keep single events table, add per-handler scheduling inside handler_results

Store backoff_until, attempts, etc. per handler in the JSONB.

Processing model:

  • When an event is locked, run only handlers whose JSONB indicates they are due.
  • Compute event-level “next wakeup” as the minimum next handler backoff (so the event row remains queryable by a single timestamp).

Pros:

  • No extra table.

Cons:

  • Hard to query/index “events with at least one handler due” without expensive JSONB scans.
  • Hard to evolve cleanly; JSON shape becomes part of the storage contract.

Option C: Split by handler into separate events (“fanout events”)

When processing an event, create child events like UserCreated.sendWelcomeEmail.

Pros:

  • Reuses existing event queueing/backoff.

Cons:

  • Amplifies event volume; complicates correlation/observability.
  • Harder to treat the original event as “done” only when all children are done.

Open questions

  • When to materialize handler rows?
    • On insert (requires knowing handler map at enqueue time) vs on first processing (requires deriving from handler map at runtime).
    • Potential approach: materialize on first lock of the parent event.
  • Where do handler-specific configs live?
    • API: handlerMap value could become { handler, backoff?, maxErrors?, ... }.
    • Back-compat: accept bare function as today.
  • How do event-level errors / maxErrors change?
    • With per-handler attempts, events.errors may become less meaningful; it could become “processor-level attempts” or be deprecated.

Proposed next steps

  • Spike Option A with a minimal Postgres client implementation + migration SQL.
  • Decide API shape for handler-specific policy (backoff/maxErrors).
  • Add docs for migration from JSONB-only tracking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions