feat: measure first-seen vs duplicate-arrival lag per source pair (dup_lag)#8
Conversation
…p_lag)
Implements the missing latency dimension of the shred-provider value
analysis ("DoubleZero Multicast Value", open question #1): receiver_stats
already shows WHICH source wins each dedup race; this measures BY HOW MUCH,
continuously, per (winner, loser) source pair.
Design (proxy side; provider naming happens in the QN Telegraf pipeline):
- DupLagTracker: first-seen map as two time-rotated DashMap shards keyed by
ahash of the raw packet bytes (the same bytes the deduper hashes, so
winner/duplicate agree with dedup by construction; ahash is already a
dependency). Value = (Instant, source addr, listen port). Rotation drops
the older shard, bounding retention to [ttl, 2*ttl) with no sweeper
thread; at the fleet's measured <=35k pps and the 1s default TTL that is
single-digit MB.
- Lag histogram: lifetime-cumulative atomics per (winner_addr, winner_port,
loser_addr, loser_port): 13 bands 50us..500ms + overflow + count +
sum_micros. Bounds follow the fleet analysis' stated unknown range
("could be 50us, could be 500ms").
- Emission each metrics interval as Prometheus-histogram shape:
measurement shredstream_proxy-dup_lag_seconds with le tag (fractional
seconds, "+Inf") and cumulative `bucket` field, plus count/sum (seconds)
fields -> VM series shredstream_proxy_dup_lag_seconds_bucket/_count/_sum
after Telegraf's "-"->"_" sanitization. Cumulative counters (never reset;
excluded from reset()) so rate()/histogram_quantile() aggregate across
hosts and time.
- Hook runs BEFORE dedup_packets_and_count_discards: Packet::data()
returns None once a packet is marked discard, and the duplicates are
exactly what must be observed. The tracker decides winner/duplicate from
its own map, so dedup behavior and forwarding are untouched.
- Hash-based sampling (--dup-lag-sample-rate, power of two, default 1):
hash & mask selection keeps both arrivals of a sampled shred observed,
which random sampling would break.
- Off by default (--measure-dup-lag); disabled cost is one Option check
per batch.
New args (CommonArgs): --measure-dup-lag, --dup-lag-ttl-ms (default 1000),
--dup-lag-sample-rate (validated power-of-two at startup).
Tests: winner/loser pair attribution + exact bucket placement (2.4ms ->
(1ms, 2.5ms] band), +Inf overflow, rotation eviction (expired first-seen
records nothing), and seeded-hash sampling consistency (sampled shreds
record exactly once, unsampled never).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Bugbot review follow-ups: 1. Idle-gap staleness (real): maybe_rotate performed one shard clear per call while advancing the deadline by the whole elapsed gap, so after a pause > 2*ttl the untouched shard kept entries far older than the documented [ttl, 2*ttl) window; a late re-arrival could then record a multi-second "lag" into sum/+Inf. The rotation now also clears the current shard when the gap overshot a full extra period. Regression test: test_dup_lag_idle_gap_drops_both_shards. 2. Concurrent first-seen TOCTOU (the real bug behind the "same batch inverts winner" report): two threads observing the same shred could both miss the lookup and double-insert, dropping the race record and crowning the later arrival. The current-shard check-or-insert now uses DashMap's atomic entry(); the older shard only ever loses entries (inserts target cur), so its plain read stays race-free. The literal in-batch claim does not apply: recvmmsg batches preserve kernel arrival order, so the first-iterated packet is the true first arrival. cargo test 22/22; clippy --all-targets -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
bugbot run |
Bugbot round 2: with --dup-lag-sample-rate > 1, observe() returned before maybe_rotate for unsampled packets, making rotation cadence depend on sampled arrivals. The reported consequence (multi-second stale lag) was already prevented by the round-1 overshoot fix — the call that could match a stale entry rotates before its lookup, and a > ttl overshoot clears both shards — but the retention invariant should not be conditional on the sampling rate. maybe_rotate now runs before the sample check (one relaxed atomic load per packet when not due). New test: test_dup_lag_rotation_runs_on_unsampled_traffic (an unsampled arrival after a 5-period gap clears both shards). cargo test 23/23; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
bugbot run |
Bugbot round 3 (valid): a stale `cur` read straddling a rotation could insert the same payload hash into both shards — one observer targets the just-demoted shard, another the new one — forking first-seen. The atomic entry() from round 1 only serializes same-shard inserts. Practical impact was small (the racers are near-simultaneous arrivals whose true order is ambiguous, and the lag error is bounded by their inter-arrival gap), but the invariant is now structural instead of probabilistic: - rotation_lock: RwLock<()>. Observers take a read pin for the lookup+insert of sampled packets (~tens of ns uncontended); the CAS winner takes write for the flip+clear, once per ttl. A flip can no longer interleave with a pinned lookup+insert, so exactly one shard can own a given hash. No nesting (maybe_rotate's write always precedes the read pin on a thread; record_lag touches only the separate hist map) — no deadlock cycle. cargo test 23/23; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.
|
bugbot run verbose=true |
|
Bugbot request id: serverGenReqId_9a088428-af80-4675-a900-cc726991cff8 |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.
Is this change upstreamable?
main-qn, cherry-pick onto a branch offmasterfor an upstream PR.Summary
Phase 1 of the shred-provider freshness program — the missing number from the DoubleZero fleet analysis (open question #1):
receiver_statsshows which source wins each dedup race; this measures by how much, continuously, per (winner, loser) source pair.DupLagTracker: first-seen map = two time-rotatedDashMapshards keyed byahashof the raw packet bytes (same bytes the deduper hashes → winner/duplicate agree with dedup by construction; no new deps). Rotation bounds retention to[ttl, 2·ttl)with no sweeper thread — single-digit MB at the fleet's measured ≤35k pps with the 1s default TTL.(winner_addr, winner_port, loser_addr, loser_port): 13 bands 50µs…500ms (the analysis' stated unknown range) ++Inf/count/sum.shredstream_proxy-dup_lag_secondswithletag (fractional seconds) + cumulativebucketfield, pluscount/sum→ VMshredstream_proxy_dup_lag_seconds_bucket{le,…}/_count/_sumafter Telegraf's-→_sanitization. Dash-prefix kept deliberately: the QN Telegraf provider-map and Anzanamedropboth key onshredstream_proxy-*.dedup_packets_and_count_discards, becausePacket::data()returnsNoneonce a packet is marked discard — and the duplicates are exactly what must be observed. The tracker decides winner/duplicate from its own map; dedup behavior and forwarding are untouched.--dup-lag-sample-rate, power-of-two, default 1 = all) so both arrivals of a sampled shred are observed. Off by default (--measure-dup-lag); disabled cost = oneOptioncheck per batch.Validation
cargo test --all-features: 21/21 incl. 4 new tests (pair attribution + exact bucket placement, +Inf overflow, rotation eviction, seeded-hash sampling consistency).cargo clippy --all-features --all-targets --tests -- -D warnings: clean.Rollout (phase-gated, after merge + release-please tag)
namepassentryshredstream_proxy-dup_lag→shredstream_proxy-dup_lag_seconds(the provider-map/namedrop globs already match).--measure-dup-lagto the unit template (canary NRT first).histogram_quantile(0.95, sum by (le, winner_provider, loser_provider) (rate(shredstream_proxy_dup_lag_seconds_bucket[$__rate_interval])))in plausible µs–ms range;_countrate ≈connection_metrics.duplicaterate.Note
Medium Risk
Adds per-packet work and bounded in-memory state on the forwarder hot path when enabled; default-off limits production impact.
Overview
Adds optional duplicate-arrival lag measurement so operators can see how much earlier one shred source wins vs another, not just who wins dedup races.
When
--measure-dup-lagis on, a newDupLagTrackerhashes raw packet bytes, keeps a TTL-bounded two-shard first-seen map, and records lag into lifetime-cumulative histograms keyed by(winner_addr, winner_port, loser_addr, loser_port). Observations run beforededup_packets_and_count_discardsso duplicate payloads are still visible viaPacket::data(). Forwarding and dedup behavior are unchanged; off by default the hot path only checksmetrics.dup_lag.CLI adds
--measure-dup-lag,--dup-lag-ttl-ms(default 1s), and--dup-lag-sample-rate(power-of-two, hash-based 1-in-N). Each metrics interval emits Prometheus-styleshredstream_proxy-dup_lag_secondsdatapoints (cumulative buckets, count, sum) with winner/loser device tags; dup-lag histograms are not cleared onShredMetrics::reset(). Seven unit tests cover pair attribution, overflow, rotation/eviction, idle gaps, and sampling.Reviewed by Cursor Bugbot for commit d9e75a6. Configure here.