Skip to content

feat(metrics): collect client stats periodically, not only on scrape#113

Draft
miotte wants to merge 1 commit into
mainfrom
miotte-pr11
Draft

feat(metrics): collect client stats periodically, not only on scrape#113
miotte wants to merge 1 commit into
mainfrom
miotte-pr11

Conversation

@miotte
Copy link
Copy Markdown
Contributor

@miotte miotte commented May 31, 2026

Fixes #111

Per-client gauges were refreshed only on /metrics scrape, so a client that connected and disconnected between two scrapes was never recorded. Adds a periodic collectClientStats (15s) started from Serve so a short-lived client is captured when a tick fires, independent of scrape timing.

@miotte miotte marked this pull request as draft May 31, 2026 18:58
@miotte miotte force-pushed the miotte-pr11 branch 2 times, most recently from c416010 to f5e0d2c Compare May 31, 2026 19:30
Per-client gauges were refreshed only when /metrics was scraped, so a
client that connected and disconnected between two scrapes never showed
up at all: the on-scrape gather runs at scrape time, finds the client
already gone, and so never sets its gauge.

Add a periodic collector (collectClientStats) started from Serve that
gathers connected-client stats every 15s, independent of scrapes. A
client connected when a tick fires has its gauge written to the sink;
because the sink expires entries only after 60s (not on disconnect),
that value survives to be emitted on the next scrape even though the
client has since disconnected. This is the only path by which a client
whose whole lifetime falls between two scrapes is recorded at all.

Note this does NOT improve freshness for connected clients: the
pre-existing on-scrape gather already re-sets every connected client
immediately before the sink is collected, so scraped values are already
current and the ticker's writes for them are overwritten before any
read.

The benefit is therefore conditional on the scrape interval. It narrows
the short-lived-client miss window from the scrape interval down to
min(15s, scrape_interval): it helps when Prometheus scrapes less often
than every 15s (e.g. 30s or 60s), and is effectively a no-op when the
scrape interval is <=15s, since the on-scrape gather then captures the
same clients at equal-or-finer granularity. The 15s constant is a fixed
guess at 'shorter than a typical scrape'; it cannot see the actual
scrape interval. Minor tradeoff: because ticks keep refreshing each
gauge's updatedAt, a disconnected client's last value can linger
slightly longer before expiring than under scrape-only collection,
bounded by the 60s Expiration either way.

Collection remains intentionally incomplete: a client whose entire
lifetime falls between two ticks is still never recorded. The per-client
gauges are a best-effort, cardinality-bounded snapshot of currently-
connected clients - the ClientIdentifier label embeds a per-connection
hash, so departed clients must age out via the sink Expiration rather
than accumulate forever. Promoting these to per-client counters to catch
short-lived clients would defeat that bound and grow cardinality without
limit. Where exact throughput is needed it is already captured per
message by the aggregate counters arke_recvmsg_total / arke_sendmsg_total
in the gRPC interceptors.

Replaces the FIXME with notes describing the sampling strategy, its known
gaps, and why they are acceptable.

Fixes #111

Signed-off-by: Michael Otteni <MichaelGOtteni@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client stats only collected on metrics scrape; short-lived clients can be missed

1 participant