Skip to content

[Critical] label_events double-counts every label action across the live and backfill paths (timestamp is part of the dedup key but is read from two different clocks) — needs a schema change #129

@dale053

Description

@dale053

Description

label_events is the append-only log used for anti-gaming label-timeline replay. It is
written from two paths that derive the dedup discriminator (timestamp) from two
different clocks
:

  • Live webhookpackages/das/src/webhook/handlers/label.handler.ts:54
    timestamp: new Date().toISOString(), // mirror *receive* time
  • GraphQL backfillpackages/das/src/webhook/github-fetcher.service.ts:1073
    timestamp: node.createdAt, // GitHub *event* time

The dedup guard uq_label_events_natural_key includes timestamp in the key
(packages/db/07_label_events.sql:20-23):

CREATE UNIQUE INDEX IF NOT EXISTS uq_label_events_natural_key
    ON label_events (repo_full_name, target_number, target_type,
                     label_name, action, timestamp)
    NULLS NOT DISTINCT;

Both paths call .orIgnore() and both comments assert idempotency. But for the same
real label action
, the live timestamp (mirror receive time) and the backfill
timestamp (GitHub createdAt) never match — they differ by the webhook delivery
latency (seconds, more on retry). Because timestamp is part of the unique key, the index
treats them as two distinct events and .orIgnore() does nothing. Result: two rows per
label action whenever a window is both live-ingested and backfilled.

This is critical and the correct fix is structural (a schema change): the table's
notion of "event identity" is wrong. No app-only tweak collapses the rows, because the one
field that distinguishes legitimate repeat actions (add → remove → re-add of the same
label) is the same field that is being sourced from two incompatible clocks.

Steps to Reproduce

  1. Install the app on a repo so the live webhook path is active. Apply a label to an
    issue/PR. A row is written with timestamp = <mirror receive time>.
  2. Trigger a backfill that covers that same window:
    POST /api/v1/admin/backfill { "repoFullName": "owner/repo" }
    (packages/das/src/api/admin.controller.ts:79).
  3. Backfill reads the LabeledEvent timeline node and writes a row with
    timestamp = node.createdAt (github-fetcher.service.ts:1073).
  4. Query the table for that (repo_full_name, target_number, label_name, action):
    two rows exist for one real action — the .orIgnore() did not collapse them
    because the timestamp column differs.
  5. Make it worse: re-run the backfill. The admin endpoint mints a fresh BullMQ
    jobId on every call — backfill-${repoFullName}-${Date.now()}
    (admin.controller.ts:91 and :143) with removeOnComplete: true — so the job is
    always re-runnable and dedup-able-by-jobId protection does not apply. Each run can add
    another live/backfill skew row for actions ingested live in between.

Expected Behavior

Each real label action is recorded exactly once, regardless of whether it arrived via
the live webhook, via backfill, or via both. Re-running backfill, or backfilling a window
that overlaps live ingestion, is a no-op for already-recorded actions.

Actual Behavior

The same label action is stored as two (or more) rows because timestamp — part of the
unique key — is taken from the mirror clock on the live path and from GitHub's clock on the
backfill path, and the two never coincide. .orIgnore() silently no-ops, and the table
grows a duplicate for every action that is seen by both paths.

Downstream impact

  • Corrupts the anti-gaming log — the raw record meant to make label timelines
    replayable now contains phantom duplicate events.
  • Pollutes the raw label_events REST API (added in feat(api): raw-data REST endpoints for comments, review-comments, label events (#98) #116) — consumers read
    double-counted history.
  • Can flip current-label / actor attribution in scoring. pr_labels_by_actor and
    issue_labels_by_actor collapse to the latest action with
    DISTINCT ON (...) ORDER BY ... le.timestamp DESC
    (packages/db/24_view_pr_labels_by_actor.sql:22,
    packages/db/25_view_issue_labels_by_actor.sql). Mixing a mirror-clock row and a
    GitHub-clock row for the same action means timestamp DESC can pick the wrong "latest"
    action/actor → wrong "is this label currently applied" and wrong attribution feeding
    miner scoring.

Environment

  • OS: N/A (server-side ingestion bug)
  • Runtime/Node version: as deployed (NestJS + TypeORM + Postgres + BullMQ)
  • Browser (if applicable): N/A
  • Affected files:
    • packages/das/src/webhook/handlers/label.handler.ts:54 (live timestamp source)
    • packages/das/src/webhook/github-fetcher.service.ts:1055-1084 (backfill timestamp source)
    • packages/db/07_label_events.sql:20-23 (unique key includes timestamp)
    • packages/db/24_view_pr_labels_by_actor.sql, 25_view_issue_labels_by_actor.sql (consumers)
    • packages/das/src/api/admin.controller.ts:91,143 (re-runnable backfill jobId)

Root Cause

timestamp is doing two incompatible jobs at once:

  1. It is the only field that distinguishes legitimate repeat actions (the same label
    added, removed, and added again must be three rows), so it cannot simply be dropped from
    the key.
  2. It is also the dedup discriminator across paths, but it is sourced from two clocks
    (new Date() live vs node.createdAt backfill) that disagree for the same event.

So the natural key has no stable, path-independent notion of event identity. The live
webhook payload does not carry the LabeledEvent/UnlabeledEvent GraphQL node id (or
its createdAt), so the two paths cannot produce a byte-identical natural key today — the
mismatch is structural, not a bug in either handler alone.

Proposed Structural Fix (schema change required)

  1. Add a stable identity column github_node_id VARCHAR NULL to label_events.
    Fetch the GraphQL id of LabeledEvent/UnlabeledEvent in the timeline query
    (github-fetcher.service.ts:738-760 and :928-951 currently fetch __typename,
    createdAt, label, actor but not id) and store it on backfill rows. This is
    GitHub's globally-unique, path-independent event id and is the authoritative identity.
  2. Repoint the dedup guard. Add a partial unique index
    UNIQUE (github_node_id) WHERE github_node_id IS NOT NULL so backfill↔backfill is
    collapsed by true identity instead of by (natural key + timestamp).
  3. Treat live rows as provisional and reconcile against backfill (authoritative). The
    live path can only write a provisional row (no node id available in the webhook). When
    backfill later writes the authoritative github_node_id row for the same
    (repo, target, target_type, label, action) within a small time window of
    node.createdAt, delete/merge the provisional live duplicate. (Mitigation, not a
    substitute: source the live timestamp from the webhook's GitHub object time as fix: use GitHub webhook updated_at for live label event timestamps #41/fix: causing out-of-order webhook corruption of current labels #81
    attempted — it narrows the skew but, since the webhook lacks the per-event
    createdAt/node id, it does not by itself make the keys match, which is why the
    structural identity column is required.)
  4. One-time dedup migration to remove existing cross-path duplicates, keeping the
    authoritative (GitHub-clock / github_node_id-bearing) row.
  5. Keep target_type semantics and the idx_label_events_target read index intact.

Acceptance Criteria

  • Applying a label live and then backfilling the same window yields exactly one row.
  • Re-running backfill any number of times adds zero new rows for already-seen actions.
  • A genuine add → remove → re-add of the same label still produces three distinct rows.
  • pr_labels_by_actor / issue_labels_by_actor return the correct current label and actor
    regardless of live/backfill ingestion order.
  • A migration removes pre-existing cross-path duplicates with no loss of distinct events.
  • The raw label_events API (feat(api): raw-data REST endpoints for comments, review-comments, label events (#98) #116) returns de-duplicated history.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions