Skip to content

Improving logging telemetry in the VMDistributed path#1996

Open
djluck wants to merge 2 commits intoVictoriaMetrics:masterfrom
djluck:improve-telemetry
Open

Improving logging telemetry in the VMDistributed path#1996
djluck wants to merge 2 commits intoVictoriaMetrics:masterfrom
djluck:improve-telemetry

Conversation

@djluck
Copy link

@djluck djluck commented Mar 23, 2026

Motivation

Due to the number of entities involved in our VMDistributed deployments and the long delays when deploying zones, we can often be either unsure about what is currently happening or overwhelmed by information. This change attempts to rectify this.

Changes

  • Common log entries are now marked as "debug" level to avoid overwhelming logs with noisy entries
  • New info-level log records to describe the start/ end of a persistent queue wait
  • waitForStatus now reports any errors periodically rather than waiting until the timeout occurs

Testing

  • Built and push a test version and ran on our staging environment, ensuring that debug level is correctly propagated to zap
  • Verified that debug-level logs aren't present by default (must pass -zap-log-level=debug)
  • All existing tests pass

Summary by cubic

Reduce log noise and add periodic status logging in VMDistributed readiness and queue draining. Easier to monitor rollouts and queue draining without flooding logs.

  • New Features
    • waitForStatus now logs current status at debug every 60s; it accepts a log interval and callers (VMAgent, VMAuth, VMCluster) are updated.
    • In VMDistributed queue draining, per-instance polling logs moved to debug; added info logs at start (“ensuring persistent queues are drained…”) and end (“all persistent queues were drained”).
    • Introduced internal/logging with LevelInfo and LevelDebug to standardize log verbosity.

Written for commit 81e37f9. Summary will update on new commits.

- Common log entries are now marked as "debug" level
- waitForStatus now reports any errors periodically rather than waiting until the timeout occurs
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/controller/operator/factory/vmdistributed/zone.go">

<violation number="1" location="internal/controller/operator/factory/vmdistributed/zone.go:369">
P2: `waitForEmptyPQ` logs queue-drained success unconditionally, even when the operation ended due to context cancellation/timeout, causing misleading telemetry.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant