Skip to content

fix: set ReportFullState flag in OpAMP responses on drift detection#6831

Draft
michel-laterman wants to merge 1 commit intoelastic:mainfrom
michel-laterman:fix/opamp-full-status-flag
Draft

fix: set ReportFullState flag in OpAMP responses on drift detection#6831
michel-laterman wants to merge 1 commit intoelastic:mainfrom
michel-laterman:fix/opamp-full-status-flag

Conversation

@michel-laterman
Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman commented Apr 13, 2026

What is the problem this PR solves?

Fleet-server does not use the OpAMP spec's ServerToAgent.flags.ReportFullState flag. When the server detects drift (sequence number gaps, incomplete first messages, or reconnects after disconnect), it has no way to ask the agent to resend its full status.

How does this PR solve the problem?

Adds two functions to the OpAMP handler:

  • hasFullStatus: capability-aware check that verifies the message contains all expected fields. AgentDescription and Health are always required. EffectiveConfig, RemoteConfigStatus, PackageStatuses, AvailableComponents, and ConnectionSettingsStatus are required only when the corresponding Reports* capability bit is set. Unset capabilities (0) always fail.
  • shouldRequestFullState: returns true when the ReportFullState flag should be set:
    1. Sequence gap: sequence_num is not exactly stored + 1
    2. New enrollment without full status
    3. Reconnect from disconnect with sequence_num == 0 without full status

The flag is computed before updateAgent and set on the ServerToAgent response.

How to test this PR locally

go test ./internal/pkg/api/ -run "TestHasFullStatus|TestShouldRequestFullState|TestHandleMessageReportFullState" -v

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Fleet-server now sets the ReportFullState flag in ServerToAgent responses
when it detects sequence number gaps, new enrollments without full status,
or reconnects from disconnected agents that don't report full state.
The full status check is capability-aware per the OpAMP spec.

Closes elastic#6783

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michel-laterman michel-laterman added bug Something isn't working backport-9.4 Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Apr 13, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 13, 2026

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/opamp-full-status-flag upstream/fix/opamp-full-status-flag
git merge upstream/main
git push upstream fix/opamp-full-status-flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.4 bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OpAMP] fleet-server should request a status report if drift is detected

1 participant