Skip to content

feat: measure first-seen vs duplicate-arrival lag per source pair (dup_lag)#8

Merged
McSim85 merged 4 commits into
main-qnfrom
feature/dup-lag-metric
Jun 12, 2026
Merged

feat: measure first-seen vs duplicate-arrival lag per source pair (dup_lag)#8
McSim85 merged 4 commits into
main-qnfrom
feature/dup-lag-metric

Conversation

@McSim85

@McSim85 McSim85 commented Jun 11, 2026

Copy link
Copy Markdown

Is this change upstreamable?

Summary

Phase 1 of the shred-provider freshness program — the missing number from the DoubleZero fleet analysis (open question #1): receiver_stats shows which source wins each dedup race; this measures by how much, continuously, per (winner, loser) source pair.

  • DupLagTracker: first-seen map = two time-rotated DashMap shards keyed by ahash of the raw packet bytes (same bytes the deduper hashes → winner/duplicate agree with dedup by construction; no new deps). Rotation bounds retention to [ttl, 2·ttl) with no sweeper thread — single-digit MB at the fleet's measured ≤35k pps with the 1s default TTL.
  • Lifetime-cumulative histogram per (winner_addr, winner_port, loser_addr, loser_port): 13 bands 50µs…500ms (the analysis' stated unknown range) + +Inf/count/sum.
  • Emitted each metrics interval as a Prometheus-shaped histogram: measurement shredstream_proxy-dup_lag_seconds with le tag (fractional seconds) + cumulative bucket field, plus count/sum → VM shredstream_proxy_dup_lag_seconds_bucket{le,…}/_count/_sum after Telegraf's -_ sanitization. Dash-prefix kept deliberately: the QN Telegraf provider-map and Anza namedrop both key on shredstream_proxy-*.
  • Hook placement note (differs from the original sketch): the observe pass runs before dedup_packets_and_count_discards, because Packet::data() returns None once a packet is marked discard — and the duplicates are exactly what must be observed. The tracker decides winner/duplicate from its own map; dedup behavior and forwarding are untouched.
  • Hash-based sampling (--dup-lag-sample-rate, power-of-two, default 1 = all) so both arrivals of a sampled shred are observed. Off by default (--measure-dup-lag); disabled cost = one Option check per batch.

Validation

  • cargo test --all-features: 21/21 incl. 4 new tests (pair attribution + exact bucket placement, +Inf overflow, rotation eviction, seeded-hash sampling consistency).
  • cargo clippy --all-features --all-targets --tests -- -D warnings: clean.
  • Memory math: ≤35k pps × 2×1s ≈ ≤70k entries ≈ single-digit MB; histogram keys are sparse real pairs only.

Rollout (phase-gated, after merge + release-please tag)

  1. infra-k8s-apps: fix the VM-output namepass entry shredstream_proxy-dup_lagshredstream_proxy-dup_lag_seconds (the provider-map/namedrop globs already match).
  2. role_solana: add --measure-dup-lag to the unit template (canary NRT first).
  3. Verify in VM: histogram_quantile(0.95, sum by (le, winner_provider, loser_provider) (rate(shredstream_proxy_dup_lag_seconds_bucket[$__rate_interval]))) in plausible µs–ms range; _count rate ≈ connection_metrics.duplicate rate.

Note

Medium Risk
Adds per-packet work and bounded in-memory state on the forwarder hot path when enabled; default-off limits production impact.

Overview
Adds optional duplicate-arrival lag measurement so operators can see how much earlier one shred source wins vs another, not just who wins dedup races.

When --measure-dup-lag is on, a new DupLagTracker hashes raw packet bytes, keeps a TTL-bounded two-shard first-seen map, and records lag into lifetime-cumulative histograms keyed by (winner_addr, winner_port, loser_addr, loser_port). Observations run before dedup_packets_and_count_discards so duplicate payloads are still visible via Packet::data(). Forwarding and dedup behavior are unchanged; off by default the hot path only checks metrics.dup_lag.

CLI adds --measure-dup-lag, --dup-lag-ttl-ms (default 1s), and --dup-lag-sample-rate (power-of-two, hash-based 1-in-N). Each metrics interval emits Prometheus-style shredstream_proxy-dup_lag_seconds datapoints (cumulative buckets, count, sum) with winner/loser device tags; dup-lag histograms are not cleared on ShredMetrics::reset(). Seven unit tests cover pair attribution, overflow, rotation/eviction, idle gaps, and sampling.

Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.

…p_lag)

Implements the missing latency dimension of the shred-provider value
analysis ("DoubleZero Multicast Value", open question #1): receiver_stats
already shows WHICH source wins each dedup race; this measures BY HOW MUCH,
continuously, per (winner, loser) source pair.

Design (proxy side; provider naming happens in the QN Telegraf pipeline):
- DupLagTracker: first-seen map as two time-rotated DashMap shards keyed by
  ahash of the raw packet bytes (the same bytes the deduper hashes, so
  winner/duplicate agree with dedup by construction; ahash is already a
  dependency). Value = (Instant, source addr, listen port). Rotation drops
  the older shard, bounding retention to [ttl, 2*ttl) with no sweeper
  thread; at the fleet's measured <=35k pps and the 1s default TTL that is
  single-digit MB.
- Lag histogram: lifetime-cumulative atomics per (winner_addr, winner_port,
  loser_addr, loser_port): 13 bands 50us..500ms + overflow + count +
  sum_micros. Bounds follow the fleet analysis' stated unknown range
  ("could be 50us, could be 500ms").
- Emission each metrics interval as Prometheus-histogram shape:
  measurement shredstream_proxy-dup_lag_seconds with le tag (fractional
  seconds, "+Inf") and cumulative `bucket` field, plus count/sum (seconds)
  fields -> VM series shredstream_proxy_dup_lag_seconds_bucket/_count/_sum
  after Telegraf's "-"->"_" sanitization. Cumulative counters (never reset;
  excluded from reset()) so rate()/histogram_quantile() aggregate across
  hosts and time.
- Hook runs BEFORE dedup_packets_and_count_discards: Packet::data()
  returns None once a packet is marked discard, and the duplicates are
  exactly what must be observed. The tracker decides winner/duplicate from
  its own map, so dedup behavior and forwarding are untouched.
- Hash-based sampling (--dup-lag-sample-rate, power of two, default 1):
  hash & mask selection keeps both arrivals of a sampled shred observed,
  which random sampling would break.
- Off by default (--measure-dup-lag); disabled cost is one Option check
  per batch.

New args (CommonArgs): --measure-dup-lag, --dup-lag-ttl-ms (default 1000),
--dup-lag-sample-rate (validated power-of-two at startup).

Tests: winner/loser pair attribution + exact bucket placement (2.4ms ->
(1ms, 2.5ms] band), +Inf overflow, rotation eviction (expired first-seen
records nothing), and seeded-hash sampling consistency (sampled shreds
record exactly once, unsampled never).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread proxy/src/forwarder.rs
Comment thread proxy/src/forwarder.rs
Bugbot review follow-ups:

1. Idle-gap staleness (real): maybe_rotate performed one shard clear per
   call while advancing the deadline by the whole elapsed gap, so after a
   pause > 2*ttl the untouched shard kept entries far older than the
   documented [ttl, 2*ttl) window; a late re-arrival could then record a
   multi-second "lag" into sum/+Inf. The rotation now also clears the
   current shard when the gap overshot a full extra period. Regression
   test: test_dup_lag_idle_gap_drops_both_shards.

2. Concurrent first-seen TOCTOU (the real bug behind the "same batch
   inverts winner" report): two threads observing the same shred could
   both miss the lookup and double-insert, dropping the race record and
   crowning the later arrival. The current-shard check-or-insert now uses
   DashMap's atomic entry(); the older shard only ever loses entries
   (inserts target cur), so its plain read stays race-free. The literal
   in-batch claim does not apply: recvmmsg batches preserve kernel arrival
   order, so the first-iterated packet is the true first arrival.

cargo test 22/22; clippy --all-targets -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@McSim85

McSim85 commented Jun 11, 2026

Copy link
Copy Markdown
Author

bugbot run

Comment thread proxy/src/forwarder.rs
Bugbot round 2: with --dup-lag-sample-rate > 1, observe() returned before
maybe_rotate for unsampled packets, making rotation cadence depend on
sampled arrivals. The reported consequence (multi-second stale lag) was
already prevented by the round-1 overshoot fix — the call that could match
a stale entry rotates before its lookup, and a > ttl overshoot clears both
shards — but the retention invariant should not be conditional on the
sampling rate. maybe_rotate now runs before the sample check (one relaxed
atomic load per packet when not due).

New test: test_dup_lag_rotation_runs_on_unsampled_traffic (an unsampled
arrival after a 5-period gap clears both shards). cargo test 23/23;
clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@McSim85

McSim85 commented Jun 11, 2026

Copy link
Copy Markdown
Author

bugbot run

Comment thread proxy/src/forwarder.rs
Bugbot round 3 (valid): a stale `cur` read straddling a rotation could
insert the same payload hash into both shards — one observer targets the
just-demoted shard, another the new one — forking first-seen. The atomic
entry() from round 1 only serializes same-shard inserts. Practical impact
was small (the racers are near-simultaneous arrivals whose true order is
ambiguous, and the lag error is bounded by their inter-arrival gap), but
the invariant is now structural instead of probabilistic:

- rotation_lock: RwLock<()>. Observers take a read pin for the
  lookup+insert of sampled packets (~tens of ns uncontended); the CAS
  winner takes write for the flip+clear, once per ttl. A flip can no
  longer interleave with a pinned lookup+insert, so exactly one shard can
  own a given hash.

No nesting (maybe_rotate's write always precedes the read pin on a thread;
record_lag touches only the separate hist map) — no deadlock cycle.
cargo test 23/23; clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@McSim85

McSim85 commented Jun 11, 2026

Copy link
Copy Markdown
Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.

@McSim85

McSim85 commented Jun 11, 2026

Copy link
Copy Markdown
Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.

@McSim85

McSim85 commented Jun 12, 2026

Copy link
Copy Markdown
Author

bugbot run verbose=true

@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

Bugbot request id: serverGenReqId_9a088428-af80-4675-a900-cc726991cff8

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.

@McSim85 McSim85 requested a review from cp0k June 12, 2026 13:06
@McSim85 McSim85 requested a review from cp0k June 12, 2026 13:09
@McSim85 McSim85 merged commit eda6120 into main-qn Jun 12, 2026
3 checks passed
@McSim85 McSim85 deleted the feature/dup-lag-metric branch June 12, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants