Skip to content

Harden schedule/diff-alert webhook delivery: retry, backoff, DLQ, and delivery API #26

Description

@harrymove-ctrl

Context

Scheduled re-scrapes fire diff alerts + HMAC-signed webhooks (runSchedule/createAlertForSchedule/deliverWebhook, apps/api/src/worker.ts:3910-4024) on the */30 cron (wrangler.jsonc triggers). Delivery is fire-once: attempts is hardcoded to 1 on both success and failure (worker.ts:4010/4018), there is no retry/backoff, no dead-letter, and no way to list or redeliver a failed webhook. processDueSchedules also silently swallows per-schedule errors (.catch(() => undefined), worker.ts:3906) and only handles LIMIT 10 per tick.

Goal / user story

As a user relying on change alerts, I want webhook deliveries to retry with backoff and be inspectable/redeliverable, so a transient 500 on my endpoint doesn't permanently drop a "context changed" notification.

Acceptance criteria

  • Migration adds next_attempt_at, max_attempts (default 5), and a payload-snapshot column to contextmem_webhook_deliveries; the status CHECK is extended with 'dead'.
  • deliverWebhook increments attempts (not hardcoded 1) and, on failure (non-2xx or throw), schedules a retry with exponential backoff (e.g. 1m/5m/30m/2h/12h) until max_attempts, then marks 'dead'.
  • A sweep in the existing */30 cron (or a dedicated Queue) re-attempts rows where status IN ('queued','failed') and next_attempt_at <= now.
  • GET /api/webhook-deliveries?scheduleId= (owner-scoped) lists deliveries with status/attempts/last error; POST /api/webhook-deliveries/:id/redeliver re-queues one.
  • Per-schedule failures in processDueSchedules are recorded (the contextmem_schedule_runs status='failed' write at worker.ts:3955 is preserved) rather than silently dropped at the loop .catch (worker.ts:3906).
  • Tests: a 500 endpoint is retried then dead-lettered; a recovered endpoint succeeds on retry; redeliver re-sends with a valid x-contextmem-signature (signWebhookPayload, worker.ts:4026).

Implementation notes

Reuse the existing contextmem_webhook_deliveries table (migrations/0004_growth_loops.sql) and HMAC signer. Prefer the bound Queue pattern (CONTEXTMEM_EXTRACT_QUEUE) or add a dedicated contextmem-webhooks queue in wrangler.jsonc for retry fan-out; for alpha the */30 cron sweep alone is sufficient. Persist the exact request body so redelivery is byte-identical (keeps the signature valid). Keep the next_run_at scheduling math (worker.ts:3946) intact. Raising the LIMIT 10 / paginating processDueSchedules is a follow-up once schedule count grows.

Sui Overflow angle

Reliable, inspectable change-alerts are the visible, demo-able payoff of the freshness pipeline ("watch a page → ContextMeM re-extracts → your agent/Slack gets a signed webhook"), and a webhook can carry the on-chain attribution receipt id once receipts land — making delivery reliability part of the provenance story, not a throwaway notification.

Dependencies

Owner-scoped delivery endpoints depend on the accounts/owner-auth issue. None otherwise.

Part of the ContextMEM roadmap (#4) • Sui Overflow build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Post-demo: distribution, polish, and reliabilityfeatureUser- or agent-facing capabilityplatformBackend platform plumbing: Worker, D1, queues, secrets, metering

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions