Skip to content

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Sep 11, 2025

What is the problem this PR solves?

Agent policy details (id + revision) may go out of sync with what fleet-server records.
This occurs when a VM running and agent with policy X on revision N is restored to an earlier time when it was running X N-1, or even a different policy. It can also occur if the ES cluster is restored to an earlier snapshot where the agent will be running policy N+1.

How does this PR solve the problem?

Allow the agents to add their currently running policy_id and
revision_idx attributes to the checkin request bodies. These attributes,
if included and different from the agent doc will be used when updating
the agent doc in the pre-poll checkin. If the agent's policy id does not
match the expected policy id from the server a reassign is detected and
a new policy change action will be sent. If the revision differs a policy
change action will also be sent. If an agent checks in with a different
policy/revision the api keys may be managed. Add a feature flag to disable
this behaviour and only use Acks + the fleet-agents doc as the source of
truth.

How to test this PR locally

Added testing in the e2e API test suite, run:

mage test:e2e

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 11, 2025
@prodsecmachine
Copy link

prodsecmachine commented Sep 11, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from 3c342b8 to f0dcf41 Compare September 11, 2025 23:43
@michel-laterman
Copy link
Contributor Author

Integration or E2E tests can be added after elastic/elasticsearch#134517 has been merged and a snapshot with the changes is available.

Allow the agents to add their currently running policy_id and
revision_idx attributes to the checkin request bodies. These attributes,
if included and different from the agent doc will be used when updating
the agent doc in the pre-poll checkin. If the agent's policy id does not
match the expected policy id from the server a reassign is detected and
a new policy change action will be sent. If the checkin ID is greater
than what was previously recorded or the policy id changes from what was
previously recoreded, then the api keys will be managed.
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from f0dcf41 to 9aa2ba2 Compare September 16, 2025 20:39
Handle the scenario when the agent checks in with a revision_idx value
that is greater than the latest available policy in ES. Add E2E tests
when using policy_id and revision_idx values in checkin.
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from 9aa2ba2 to 557ffd0 Compare September 16, 2025 21:40
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from b59e45c to a5c7f64 Compare September 17, 2025 16:58
@michel-laterman michel-laterman marked this pull request as ready for review September 17, 2025 20:16
@michel-laterman michel-laterman requested a review from a team as a code owner September 17, 2025 20:16
ycombinator
ycombinator previously approved these changes Sep 22, 2025
Copy link

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the fix for the revision being higher. Looks good.

@michel-laterman michel-laterman merged commit 22f1f7a into elastic:main Sep 23, 2025
13 checks passed
@michel-laterman michel-laterman deleted the feat/checkin-policy-details branch September 23, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants