Skip to content

feat(telemetry): add anonymous opt-out OpenShell usage telemetry#1433

Open
kirit93 wants to merge 2 commits into
mainfrom
kirit93/telemetry
Open

feat(telemetry): add anonymous opt-out OpenShell usage telemetry#1433
kirit93 wants to merge 2 commits into
mainfrom
kirit93/telemetry

Conversation

@kirit93
Copy link
Copy Markdown
Collaborator

@kirit93 kirit93 commented May 18, 2026

Summary

Add lightweight, opt-out telemetry for anonymous OpenShell usage metrics. This gives the project aggregate visibility into sandbox lifecycle activity, policy updates, provider operations, sandbox creation shape, and sandbox network-denial summaries.

Telemetry does not track individual users and does not collect personal data, sandbox IDs, sandbox names, hosts, paths, model names, provider names, credentials, prompts, request payloads, or user content.

The telemetry events captured are:

  • Sandbox creation: records whether a sandbox create request succeeded or failed. It also records whether GPU was requested as true/false, the number of providers attached as a count only, whether a custom policy was provided as true/false with no policy details collected, whether the template source was default or image-based without collecting the image name, and which compute driver was used (docker, kubernetes, podman, vm, or unknown).

  • Sandbox deletion: records whether a sandbox delete request succeeded or failed.

  • Sandbox policy updates: records whether a gateway-level sandbox policy update succeeded or failed. Policy contents and rule details are not collected.

  • Provider lifecycle: records provider create, update, and delete outcomes. Provider type is mapped to a broad profile bucket such as anthropic, claude, codex, github, gitlab, nvidia, openai, outlook, or custom. Provider names, credentials, endpoints, and configuration details are not collected.

  • Policy draft decisions: records whether approve, reject, approve-all, or undo actions succeeded or failed. It also records the number of rules affected as a count only. Rule contents are not collected.

  • Sandbox network activity summaries: records aggregate network activity counts, denied action counts, denial rate percentage, and denial counts grouped into coarse categories such as connect policy, forward policy, L7 policy, SSRF, bypass, stale policy, or unknown. Destinations, hostnames, URLs, request paths, payloads, binaries, and raw deny messages are not collected.

Telemetry can be disabled at deployment time by setting OPENSHELL_TELEMETRY_ENABLED=false on the gateway.

Related Issue

Fixes #1054

Testing

  • mise run pre-commit passes
  • Unit tests added/updated

Checklist

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kirit93 kirit93 requested a review from johntmyers May 18, 2026 19:59
@github-actions
Copy link
Copy Markdown

@zredlined zredlined assigned zredlined and kirit93 and unassigned zredlined May 18, 2026
@zredlined zredlined added the topic:observability Logging, metrics, and observability work label May 18, 2026
@kirit93 kirit93 force-pushed the kirit93/telemetry branch from 4aa8e3c to fab5164 Compare May 18, 2026 20:27
@zredlined
Copy link
Copy Markdown
Collaborator

Looks like the Rust telemetry helper shells out to python3 and defaults to running scripts/publish_telemetry.py from the source tree. I’m worried this will likely no-op in packaged/runtime environments: the gateway image is distroless, the supervisor image is scratch, and neither appears to include Python or the telemetry script. Since failures are intentionally silent, we may not notice that telemetry never publishes outside local/dev setups.

Suggestion: we might want to implement the telemetry publisher fully in Rust for better portability. That would keep the runtime behavior inside the shipped binaries, avoid depending on Python/script packaging, and make the opt-out/publish path easier to test across gateway, supervisor, and package installs.

@zredlined
Copy link
Copy Markdown
Collaborator

@kirit93 Can you add a short telemetry section to the primary README as part of this PR?

@kirit93 kirit93 force-pushed the kirit93/telemetry branch from 20df3af to 96c8730 Compare May 19, 2026 16:07
@kirit93 kirit93 moved this from Todo to In progress in OpenShell Roadmap May 19, 2026
@kirit93 kirit93 force-pushed the kirit93/telemetry branch from dc0aa8a to dbab5e8 Compare May 19, 2026 20:15
Copy link
Copy Markdown
Contributor

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems to be wrong with your branch. It appears to be reverting a lot of changes from main.

@johntmyers
Copy link
Copy Markdown
Collaborator

@russellb yeah i'm looking at this as well

const TELEMETRY_EVENT_QUEUE_CAPACITY: usize = 1024;
const MAX_TELEMETRY_INTEGER: u64 = 9_223_372_036_854_775_807;
const CLIENT_ID: &str = "415437562476676";
const DEFAULT_ENDPOINT: &str = "https://events.telemetry.data-uat.nvidia.com/v1.1/events/json";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that could ever be visible to people outside of NVIDIA?

@kirit93 kirit93 force-pushed the kirit93/telemetry branch from dbab5e8 to 5b2e26f Compare May 20, 2026 17:34
Comment thread crates/openshell-sandbox/src/proxy.rs Outdated

let effectively_denied = force_deny
|| (!allowed && l7_config.config.enforcement == crate::l7::EnforcementMode::Enforce);
emit_activity_simple(activity_tx, effectively_denied, "l7_policy");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This records an L7 activity event before the forward-proxy SSRF checks. If L7 allows but the later SSRF check denies the same request, the same request gets counted once as allowed L7 activity here and again as denied SSRF activity below. That inflates networkActivityCount and understates denialRatePct. Please defer the L7 allowed activity record until after SSRF succeeds, or otherwise ensure each forward request contributes exactly one activity event.

Signed-off-by: Kirit93 <kthadaka@nvidia.com>
@kirit93 kirit93 force-pushed the kirit93/telemetry branch from 5b2e26f to bdc660f Compare May 20, 2026 19:39
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:observability Logging, metrics, and observability work

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

OpenShell Telemetry

4 participants