Skip to content

Netis/TokenScope

Repository files navigation

TokenScope

Agent observability from the network wire. A passive analyzer that watches LLM traffic on the wire and reconstructs what your agents are actually doing — tool calls, multi-step plans, where time is spent, where loops happen, who calls whom — without an SDK, sidecar, or proxy in the request path.

Agent turn detail — a 247-call agent run, every tool call ordered on the Timeline, drilling into one call's request/response on the right

What it does

Most agent code looks fine on paper and falls apart in production: a tool call stalls, the planner loops between two states, a downstream service silently substitutes a different model. TokenScope reconstructs that behavior from the bytes on the wire — packet capture → HTTP / SSE parse → wire-API decode → semantic extraction → agent-turn assembly — and serves the result through a console that's organized around turns and sessions, not raw HTTP calls.

It reads post-TLS traffic — on the inference host, behind a TLS terminator, or fed in from a SPAN/TAP point via cloud-probe. Multi-call agent interactions (planner → tool → planner → tool …) stitch into a single addressable turn. Multi-leg proxy hops (litellm in front of vLLM/SGLang/haproxy) fold automatically. The pipeline never sits in the request path, so the observer can fail without breaking the calls being observed.

NIC / .pcap file / cloud-probe (ZMQ)
        │
        ▼
   capture → flow dispatcher (hash by 5-tuple)
        │
        ▼
   N parallel workers: HTTP/SSE parse → wire-API detection → semantic extraction
        │
        ▼
   turn tracker  +  metrics aggregator  +  storage sink
        │
        ▼
       DuckDB ─── REST API ─── React console (localhost:3000)

Same connection's packets always land on the same worker, so parsing state is local and lock-free. Multiple independent pipelines can run side-by-side — e.g., low-latency local capture isolated from bursty cloud-probe ingress.

Why not an SDK / proxy / OpenTelemetry?

Approach In request path Needs client cooperation Sees full bodies Reconstructs agent turns
SDK instrumentation yes every client must yes every client must emit
Reverse proxy (LiteLLM …) yes clients point at it yes per-call only
OpenTelemetry from server yes server must emit partial if the server tags it
TokenScope no no yes¹ yes

¹ TLS-terminated traffic only — TokenScope sees plaintext HTTP. Install it where the traffic is already decrypted: on the inference host, behind the TLS terminator, or fed by cloud-probe from a SPAN/TAP point.

The trade-off is honest: you give up cross-cluster client tracing, you get a single passive evidence chain that can't break the call when the observer fails, that requires zero cooperation from the workloads being observed, and that assembles the agent narrative for you instead of leaving you to join calls into turns in your data warehouse.

What's in the box

Ingress

  • libpcap on a live interface
  • Replay from .pcap files (any speed)
  • ZMQ from cloud-probe for hosts you can't install on directly

Agent-turn reconstruction with named profiles for Claude CLI (Claude Code) and OpenAI Codex CLI, a generic profile for everything else, plus an experimental OpenClaw profile. Turns stitch multi-call agent interactions (tool call → tool result → planner → next tool, repeat) into a single addressable unit. The hero screenshot above is one such turn — 247 calls, ordered on the Timeline, drillable into the request/response of any single call.

Agent Turns list — sorted by call count to surface the most complex agent runs first

Service topology — see the agent's call graph, not just the calls. The Services page's Path view shows your inference fleet as a directed graph: clients → litellm proxies → vLLM / SGLang backends, with edge thickness scaled by turn count. Proxy hops paired by the passive sweeper render as solid edges; heuristically-inferred hops (when the inbound client_ip matches a known service) render as dashed; anonymous client traffic is dotted. The classifier names what each endpoint actually serves — vLLM, SGLang, Ollama, llama.cpp, LiteLLM — from the bytes on the wire, not from configuration the operator told it.

Services Path view — service-to-service call graph with proxy / inferred / client edges, colored by app

Wire-API decoders

  • OpenAI Chat Completions (/v1/chat/completions)
  • OpenAI Responses (/v1/responses)
  • Anthropic Messages (/v1/messages)
  • Gemini AI Studio (generativelanguage.googleapis.com)

This covers OpenAI direct, Azure OpenAI, Anthropic direct, AWS Bedrock / GCP Vertex (Anthropic wire), Google Gemini, and any OpenAI-compatible local server — vLLM, SGLang, Ollama, llama.cpp's server, LM Studio, etc.

Per-call drill-down when you need it — every LLM call is also captured with structured request/response and the raw body. Stalled tool calls, malformed prompts, unexpected token counts: the evidence is on the page, not behind a re-run.

Metrics are framed first at the agent layer — turn count and duration distribution per agent kind, call count per turn, tool-call success rate — and then at the call layer: TTFT · E2E latency · TPOT · token throughput · call rate · active calls · call error rate · prompt-cache hit ratio. The Overview page is built around both. See glossary for what each means and why.

Overview — agent activity timeseries + per-kind distribution at the top, call-rate / latency / error rate / per-model panels below

Storage in DuckDB (default, embedded, single-file) with per-table retention enabled out of the box. Pluggable backend trait — PostgreSQL and ClickHouse are designed but not yet wired.

Console at http://localhost:3000: overview · performance · usage · errors · services (table / path / model views) · agent turns · agent sessions · LLM calls (with full request/response body drill-down) · raw HTTP exchanges · pipeline-health debug views.

More console screenshots

Services Table view — per-endpoint metrics with auto-classified app type

Agent session — full transcript across turns

Traffic — call rate and token throughput over time

Pipeline health — internal queue depth, drop counters, leak canaries

Distribution: prebuilt static binaries for Linux musl (x86_64 + aarch64) and macOS (Intel + Apple Silicon). Web console is embedded in the binary — single artifact, no separate frontend deploy.

Who it's for

  • Agent developers — debug stalled tool calls, detect plan-loop / "no submit" failure modes, and see exactly which model+endpoint each turn hit, without modifying the agent or its SDK
  • AI platform / inference ops — see the real service-to-service topology your traffic flows through (clients → litellm → vLLM / SGLang), measure each hop independently, and catch silent model substitutions
  • FinOps & engineering managers — attribute spend across teams/repos/projects from real turns, not periodic SDK exports that can drift
  • Compliance & security — capture-once evidence chain of what crossed the wire, scoped per agent kind and per session

Quickstart

# Install (Linux/macOS, no sudo, user-local)
curl -fsSL https://raw.githubusercontent.com/Netis/TokenScope/main/install.sh \
  | INSTALL_DIR="$HOME/.local" sh

# Linux: grant capture privileges to the binary (no sudo at runtime)
sudo setcap cap_net_raw,cap_net_admin=eip ~/.local/bin/tokenscope

# Capture from a live interface
tokenscope -i eth0

# ...or replay a pcap (no privileges needed)
tokenscope --pcap-file capture.pcap --no-retention

Then open http://localhost:3000.

After a pcap finishes replaying, the process keeps the API/console available so you can browse the results — press Ctrl+C to exit, or pass --exit-after-drain for batch/CI use that exits as soon as the pipeline drains.

TokenScope sees plaintext HTTP. Install it where the traffic is already decrypted, such as on the inference host, behind a TLS terminator, or fed from a trusted packet source.

For systemd deployment, capability options, and uninstall, see docs/install.md.

Documentation

  • Install — one-line installer, systemd, capabilities
  • Configure — pipelines, sources, storage, retention
  • Glossary — what every metric means
  • Architecture — pipeline design and trade-offs
  • Mission — long-arc vision

Roadmap

The current surface is the foundation layer (Ops use cases). On the way:

  • Storage — PostgreSQL and ClickHouse backends (schemas already designed)
  • Wire APIs — more provider-specific extensions (Bedrock variants, Vertex non-Anthropic, etc.)

See docs/mission.md for the full ladder.

Contributing

Bug reports and PRs welcome. Before opening a PR, run:

just build all       # single binary with embedded console
just quality all     # rust fmt + clippy + ts lint + tsc
just test all        # cargo test (all crates)

Run just help for the full menu. Design docs under docs/design/ describe the per-module contract — read the relevant one before changing anything load-bearing.

License

Apache 2.0.

About

Agent and LLM API performance monitoring via network packet probe. Measures performance of OpenClaw, Claude, Codex, DeepAgents and more — deployed on the provider side, no SDK changes required.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors