You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Critical] label_events double-counts every label action across the live and backfill paths (timestamp is part of the dedup key but is read from two different clocks) — needs a schema change #129
label_events is the append-only log used for anti-gaming label-timeline replay. It is
written from two paths that derive the dedup discriminator (timestamp) from two
different clocks:
Live webhook — packages/das/src/webhook/handlers/label.handler.ts:54
timestamp: newDate().toISOString(),// mirror *receive* time
The dedup guard uq_label_events_natural_key includes timestamp in the key
(packages/db/07_label_events.sql:20-23):
CREATEUNIQUE INDEXIF NOT EXISTS uq_label_events_natural_key
ON label_events (repo_full_name, target_number, target_type,
label_name, action, timestamp)
NULLS NOT DISTINCT;
Both paths call .orIgnore() and both comments assert idempotency. But for the same
real label action, the live timestamp (mirror receive time) and the backfill timestamp (GitHub createdAt) never match — they differ by the webhook delivery
latency (seconds, more on retry). Because timestamp is part of the unique key, the index
treats them as two distinct events and .orIgnore() does nothing. Result: two rows per
label action whenever a window is both live-ingested and backfilled.
This is critical and the correct fix is structural (a schema change): the table's
notion of "event identity" is wrong. No app-only tweak collapses the rows, because the one
field that distinguishes legitimate repeat actions (add → remove → re-add of the same
label) is the same field that is being sourced from two incompatible clocks.
Steps to Reproduce
Install the app on a repo so the live webhook path is active. Apply a label to an
issue/PR. A row is written with timestamp = <mirror receive time>.
Trigger a backfill that covers that same window: POST /api/v1/admin/backfill { "repoFullName": "owner/repo" }
(packages/das/src/api/admin.controller.ts:79).
Backfill reads the LabeledEvent timeline node and writes a row with timestamp = node.createdAt (github-fetcher.service.ts:1073).
Query the table for that (repo_full_name, target_number, label_name, action): two rows exist for one real action — the .orIgnore() did not collapse them
because the timestamp column differs.
Make it worse: re-run the backfill. The admin endpoint mints a fresh BullMQ jobId on every call — backfill-${repoFullName}-${Date.now()}
(admin.controller.ts:91 and :143) with removeOnComplete: true — so the job is
always re-runnable and dedup-able-by-jobId protection does not apply. Each run can add
another live/backfill skew row for actions ingested live in between.
Expected Behavior
Each real label action is recorded exactly once, regardless of whether it arrived via
the live webhook, via backfill, or via both. Re-running backfill, or backfilling a window
that overlaps live ingestion, is a no-op for already-recorded actions.
Actual Behavior
The same label action is stored as two (or more) rows because timestamp — part of the
unique key — is taken from the mirror clock on the live path and from GitHub's clock on the
backfill path, and the two never coincide. .orIgnore() silently no-ops, and the table
grows a duplicate for every action that is seen by both paths.
Downstream impact
Corrupts the anti-gaming log — the raw record meant to make label timelines
replayable now contains phantom duplicate events.
Can flip current-label / actor attribution in scoring.pr_labels_by_actor and issue_labels_by_actor collapse to the latest action with DISTINCT ON (...) ORDER BY ... le.timestamp DESC
(packages/db/24_view_pr_labels_by_actor.sql:22, packages/db/25_view_issue_labels_by_actor.sql). Mixing a mirror-clock row and a
GitHub-clock row for the same action means timestamp DESC can pick the wrong "latest"
action/actor → wrong "is this label currently applied" and wrong attribution feeding
miner scoring.
It is the only field that distinguishes legitimate repeat actions (the same label
added, removed, and added again must be three rows), so it cannot simply be dropped from
the key.
It is also the dedup discriminator across paths, but it is sourced from two clocks
(new Date() live vs node.createdAt backfill) that disagree for the same event.
So the natural key has no stable, path-independent notion of event identity. The live
webhook payload does not carry the LabeledEvent/UnlabeledEvent GraphQL node id (or
its createdAt), so the two paths cannot produce a byte-identical natural key today — the
mismatch is structural, not a bug in either handler alone.
Proposed Structural Fix (schema change required)
Add a stable identity columngithub_node_id VARCHAR NULL to label_events.
Fetch the GraphQL id of LabeledEvent/UnlabeledEvent in the timeline query
(github-fetcher.service.ts:738-760 and :928-951 currently fetch __typename, createdAt, label, actor but notid) and store it on backfill rows. This is
GitHub's globally-unique, path-independent event id and is the authoritative identity.
Repoint the dedup guard. Add a partial unique index UNIQUE (github_node_id) WHERE github_node_id IS NOT NULL so backfill↔backfill is
collapsed by true identity instead of by (natural key + timestamp).
Treat live rows as provisional and reconcile against backfill (authoritative). The
live path can only write a provisional row (no node id available in the webhook). When
backfill later writes the authoritative github_node_id row for the same (repo, target, target_type, label, action) within a small time window of node.createdAt, delete/merge the provisional live duplicate. (Mitigation, not a
substitute: source the live timestamp from the webhook's GitHub object time as fix: use GitHub webhook updated_at for live label event timestamps #41/fix: causing out-of-order webhook corruption of current labels #81
attempted — it narrows the skew but, since the webhook lacks the per-event createdAt/node id, it does not by itself make the keys match, which is why the
structural identity column is required.)
One-time dedup migration to remove existing cross-path duplicates, keeping the
authoritative (GitHub-clock / github_node_id-bearing) row.
Keep target_type semantics and the idx_label_events_target read index intact.
Acceptance Criteria
Applying a label live and then backfilling the same window yields exactly one row.
Re-running backfill any number of times adds zero new rows for already-seen actions.
A genuine add → remove → re-add of the same label still produces three distinct rows.
pr_labels_by_actor / issue_labels_by_actor return the correct current label and actor
regardless of live/backfill ingestion order.
A migration removes pre-existing cross-path duplicates with no loss of distinct events.
Description
label_eventsis the append-only log used for anti-gaming label-timeline replay. It iswritten from two paths that derive the dedup discriminator (
timestamp) from twodifferent clocks:
packages/das/src/webhook/handlers/label.handler.ts:54packages/das/src/webhook/github-fetcher.service.ts:1073The dedup guard
uq_label_events_natural_keyincludestimestampin the key(
packages/db/07_label_events.sql:20-23):Both paths call
.orIgnore()and both comments assert idempotency. But for the samereal label action, the live
timestamp(mirror receive time) and the backfilltimestamp(GitHubcreatedAt) never match — they differ by the webhook deliverylatency (seconds, more on retry). Because
timestampis part of the unique key, the indextreats them as two distinct events and
.orIgnore()does nothing. Result: two rows perlabel action whenever a window is both live-ingested and backfilled.
This is critical and the correct fix is structural (a schema change): the table's
notion of "event identity" is wrong. No app-only tweak collapses the rows, because the one
field that distinguishes legitimate repeat actions (add → remove → re-add of the same
label) is the same field that is being sourced from two incompatible clocks.
Steps to Reproduce
issue/PR. A row is written with
timestamp = <mirror receive time>.POST /api/v1/admin/backfill { "repoFullName": "owner/repo" }(
packages/das/src/api/admin.controller.ts:79).LabeledEventtimeline node and writes a row withtimestamp = node.createdAt(github-fetcher.service.ts:1073).(repo_full_name, target_number, label_name, action):two rows exist for one real action — the
.orIgnore()did not collapse thembecause the
timestampcolumn differs.jobIdon every call —backfill-${repoFullName}-${Date.now()}(
admin.controller.ts:91and:143) withremoveOnComplete: true— so the job isalways re-runnable and dedup-able-by-jobId protection does not apply. Each run can add
another live/backfill skew row for actions ingested live in between.
Expected Behavior
Each real label action is recorded exactly once, regardless of whether it arrived via
the live webhook, via backfill, or via both. Re-running backfill, or backfilling a window
that overlaps live ingestion, is a no-op for already-recorded actions.
Actual Behavior
The same label action is stored as two (or more) rows because
timestamp— part of theunique key — is taken from the mirror clock on the live path and from GitHub's clock on the
backfill path, and the two never coincide.
.orIgnore()silently no-ops, and the tablegrows a duplicate for every action that is seen by both paths.
Downstream impact
replayable now contains phantom duplicate events.
label_eventsREST API (added in feat(api): raw-data REST endpoints for comments, review-comments, label events (#98) #116) — consumers readdouble-counted history.
pr_labels_by_actorandissue_labels_by_actorcollapse to the latest action withDISTINCT ON (...) ORDER BY ... le.timestamp DESC(
packages/db/24_view_pr_labels_by_actor.sql:22,packages/db/25_view_issue_labels_by_actor.sql). Mixing a mirror-clock row and aGitHub-clock row for the same action means
timestamp DESCcan pick the wrong "latest"action/actor → wrong "is this label currently applied" and wrong attribution feeding
miner scoring.
Environment
packages/das/src/webhook/handlers/label.handler.ts:54(live timestamp source)packages/das/src/webhook/github-fetcher.service.ts:1055-1084(backfill timestamp source)packages/db/07_label_events.sql:20-23(unique key includestimestamp)packages/db/24_view_pr_labels_by_actor.sql,25_view_issue_labels_by_actor.sql(consumers)packages/das/src/api/admin.controller.ts:91,143(re-runnable backfill jobId)Root Cause
timestampis doing two incompatible jobs at once:added, removed, and added again must be three rows), so it cannot simply be dropped from
the key.
(
new Date()live vsnode.createdAtbackfill) that disagree for the same event.So the natural key has no stable, path-independent notion of event identity. The live
webhook payload does not carry the
LabeledEvent/UnlabeledEventGraphQL node id (orits
createdAt), so the two paths cannot produce a byte-identical natural key today — themismatch is structural, not a bug in either handler alone.
Proposed Structural Fix (schema change required)
github_node_id VARCHAR NULLtolabel_events.Fetch the GraphQL
idofLabeledEvent/UnlabeledEventin the timeline query(
github-fetcher.service.ts:738-760and:928-951currently fetch__typename,createdAt,label,actorbut notid) and store it on backfill rows. This isGitHub's globally-unique, path-independent event id and is the authoritative identity.
UNIQUE (github_node_id) WHERE github_node_id IS NOT NULLso backfill↔backfill iscollapsed by true identity instead of by
(natural key + timestamp).live path can only write a provisional row (no node id available in the webhook). When
backfill later writes the authoritative
github_node_idrow for the same(repo, target, target_type, label, action)within a small time window ofnode.createdAt, delete/merge the provisional live duplicate. (Mitigation, not asubstitute: source the live timestamp from the webhook's GitHub object time as fix: use GitHub webhook updated_at for live label event timestamps #41/fix: causing out-of-order webhook corruption of current labels #81
attempted — it narrows the skew but, since the webhook lacks the per-event
createdAt/node id, it does not by itself make the keys match, which is why thestructural identity column is required.)
authoritative (GitHub-clock /
github_node_id-bearing) row.target_typesemantics and theidx_label_events_targetread index intact.Acceptance Criteria
pr_labels_by_actor/issue_labels_by_actorreturn the correct current label and actorregardless of live/backfill ingestion order.
label_eventsAPI (feat(api): raw-data REST endpoints for comments, review-comments, label events (#98) #116) returns de-duplicated history.