Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Sep 17, 2025

What does this PR do?

This PR introduces support for emitting system resource metrics for the EDOT (Elastic Distribution of OpenTelemetry) collector when it runs as a subprocess of the Elastic Agent.

Key changes:

  • Added a new features.agent.otel.subprocess_execution feature flag to control whether the OTel collector runs in subprocess execution mode.
    • This flag defaults to false for now (maintaining existing behaviour), but is expected to default to true in the imminent future.
  • Extended the monitoring configuration generation logic to create a dedicated HTTP metrics stream for the EDOT subprocess, going by the name elastic-agent/collector, capturing only its system resources usage.
  • Updated OTelManager construction to honor the execution mode parsed from the feature flags, rather than always running in embedded mode.

Why is it important?

Running the collector as a subprocess improves resilience by isolating the control plane from the data plane. Emitting metrics for the EDOT process ensures operational visibility, allowing users to observe and troubleshoot its resource usage independently from the main Elastic Agent process.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive impact expected.

  • When features.agent.otel.subprocess_execution is false (default), behavior is unchanged.
  • When the flag is enabled, users will see an additional monitoring stream for elastic-agent/collector.

How to test this PR locally

  1. Build Elastic Agent from this branch.
  2. Enable the subprocess execution mode in your elastic-agent.yml:
    agent:
      features:
        otel:
          subprocess_execution: true
  3. Install elastic-agent.
  4. Verify in Kibana’s Agent Metrics dashboard that a separate metrics stream for elastic-agent/collector appears. PS: you might have to install the elastic-agent integration if it's not already installed
Screenshot 2025-09-17 at 2 39 39 PM

Related issues

N/A

@pkoutsovasilis pkoutsovasilis self-assigned this Sep 17, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.19 Automated backport to the 8.19 branch labels Sep 17, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the feat/add_edot_subprocess_monitoring branch from e05ba2f to 004abb2 Compare September 17, 2025 10:21
@pkoutsovasilis pkoutsovasilis force-pushed the feat/add_edot_subprocess_monitoring branch from 004abb2 to a4da5e0 Compare September 18, 2025 11:05
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review September 18, 2025 11:16
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner September 18, 2025 11:16
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Sep 18, 2025

Extended the monitoring configuration generation logic to create a dedicated HTTP metrics stream for the EDOT subprocess, going by the name elastic-agent/collector, capturing only its system resources usage.

From what I know/remember, I think this metrics.elastic-agent/collector-* index will have to be in the list of well known monitoring indices in Fleet defined here in Kibana, and it'll also have to be defined as a data stream in the elastic_agent package.

I do not have the entire history on why the monitoring indices need to be defined in Kibana, but I don't think we'll have Fleet managed output API keys generated with permissions to write to the new data stream without that.

Edit: The test that would prove the need for this is to make sure this data stream can be written to for a Fleet managed agent.

@cmacknz
Copy link
Member

cmacknz commented Sep 18, 2025

There is an existing test that I think could be easily updated to prove this works e2e for Fleet managed agents, by expecting the elastic-agent/collector component to have system metrics populated.

componentIds := []string{
fmt.Sprintf("system/metrics-%s", UnitOutputName),
fmt.Sprintf("log-%s", UnitOutputName),
"beat/metrics-monitoring",
"elastic-agent",
"http/metrics-monitoring",
"filestream-monitoring",
}

@pkoutsovasilis pkoutsovasilis force-pushed the feat/add_edot_subprocess_monitoring branch from cf3c623 to 4cd10aa Compare September 23, 2025 14:23
@pkoutsovasilis
Copy link
Contributor Author

From what I know/remember, I think this metrics.elastic-agent/collector-* index will have to be in the list of well known monitoring indices in Fleet defined here in Kibana, and it'll also have to be defined as a data stream in the elastic_agent package.

I do not have the entire history on why the monitoring indices need to be defined in Kibana, but I don't think we'll have Fleet managed output API keys generated with permissions to write to the new data stream without that.

Edit: The test that would prove the need for this is to make sure this data stream can be written to for a Fleet managed agent.

Look, the resource metrics for the EDOT subprocess are being written to the existing elastic-agent.elastic-agent dataset, so we don’t need to make any changes on the Kibana side. I didn’t do anything special, and the metrics show up correctly in the Agent Metrics dashboard.
Screenshot 2025-09-24 at 6 00 01 PM

There is an existing test that I think could be easily updated to prove this works e2e for Fleet managed agents, by expecting the elastic-agent/collector component to have system metrics populated.

As discussed offline with @cmacknz, since this setting is short-lived, we’ll handle the end-to-end testing in a follow-up PR when subprocess execution mode becomes the default.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, one comment about monitoring config reloading and a question about the new monitoring server.

Copy link
Contributor

@kaanyalti kaanyalti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, waiting for ci to be green

@pkoutsovasilis
Copy link
Contributor Author

Changes look good to me, waiting for ci to be green

@kaanyalti the CI won't get green as the Merge coverage reports step is failing with this error

2025/09/23 17:44:57 OVERLAP MERGE: github.com/elastic/elastic-agent/internal/pkg/otel/components.go {88 43 118 17 6 68} {88 43 122 17 7 5}

which I am not quite sure what causes that but I am fairly confident that isn't an error due to the functionality introduced by this PR

@swiatekm swiatekm self-requested a review September 24, 2025 17:37
swiatekm
swiatekm previously approved these changes Sep 24, 2025
kaanyalti
kaanyalti previously approved these changes Sep 24, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to block this PR from merging as it looks good, other than my 1 inline comment.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2025

Built locally and can see it's shipping metrics 👍

I did notice when inspecting diagnostics that we don't get profiles from the collector sub-process, so I created #10135 to track including these separately from this PR and added it to the parent issue for the sub-process conversion.

@pkoutsovasilis
Copy link
Contributor Author

Built locally and can see it's shipping metrics 👍

I did notice when inspecting diagnostics that we don't get profiles from the collector sub-process, so I created #10135 to track including these separately from this PR and added it to the parent issue for the sub-process conversion.

Thanks for creating the tracking issue in my mind, this could/should be covered by the EDOT diagnostics extension PR here

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2025

OK I'm not as concerned as which piece of work it is part of, more with ensuring it gets implemented.

@pkoutsovasilis pkoutsovasilis dismissed stale reviews from kaanyalti and swiatekm via c181a17 September 24, 2025 20:59
@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2025

Out of curiosity I tried to hit the unix socket myself, and I'm getting a 404 on /stats but not on /debug/stats. The metricbeat modules hit /stats and seem to generate data, what am I missing here?

❯ ps aux | rg otel
root             96349   0.0  0.3 412492384 154320   ??  S     4:46PM   0:07.76 /Library/Elastic/Agent-Development/elastic-agent otel --supervised --supervised.logging.level=info --supervised.monitoring.url=unix:///Library/Elastic/Agent-Development/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock

❯ sudo curl --unix-socket /Library/Elastic/Agent-Development/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http:/sta
ts
404 page not found

❯ sudo curl --unix-socket /Library/Elastic/Agent-Development/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http:/debug/stats
{"beat":{"cpu":{"system":{"ticks":3405,"time":{"ms":3405}},"total":{"ticks":7885,"time":{"ms":7885},"value":7885.641791666666},"user":{"ticks":4480,"time":{"ms":4480}}},"info":{"ephemeral_id":"f42f9139-32de-4226-b9cd-b4b214fa053f","name":"elastic-agent/c","uptime":{"ms":3267401},"version":"9.2.0"},"memstats":{"gc_next":52060306,"memory_alloc":26579376,"memory_sys":67912968,"memory_total":462099568,"rss":158072832},"runtime":{"goroutines":173}},"filebeat":{"harvester":{"closed":1,"gzip_closed":0,"gzip_open_files":0,"gzip_running":0,"gzip_started":0,"open_files":1,"running":1,"skipped":0,"started":2},"input":{"log":{"files":{"renamed":0,"truncated":0}}}},"libbeat":{"config":{"module":{"running":0,"starts":0,"stops":0},"reloads":0,"scans":0}},"processor":{"add_host_metadata":{"fqdn_lookup_failed":0}},"registrar":{"states":{"cleanup":0,"current":0,"update":0},"writes":{"fail":0,"success":0,"total":0}},"system":{"cpu":{"cores":14},"load":{"1":2.3857,"15":2.1997,"5":2.2944,"norm":{"1":0.1704,"15":0.1571,"5":0.1639}}}}%

@pkoutsovasilis
Copy link
Contributor Author

@cmacknz I think that this is quirk of curl and unix-sockets. Invoking it like that curl --unix-socket /usr/share/elastic-agent/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http://localhost/stats works as expected

# curl --unix-socket /usr/share/elastic-agent/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http://localhost/stats
{"beat":{"cgroup":{"cpu":{"id":"/","stats":{"periods":0,"throttled":{"ns":0,"periods":0}}},"memory":{"id":"/","mem":{"usage":{"bytes":4232740864}}}},"cpu":{"system":{"ticks":6780,"time":{"ms":6780}},"total":{"ticks":34140,"time":{"ms":34140},"value":34140},"user":{"ticks":27360,"time":{"ms":27360}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":33},"info":{"ephemeral_id":"893ffa32-6758-4393-9e32-4f627378b4a1","name":"elastic-agent/c","uptime":{"ms":3293315},"version":"9.2.0"},"memstats":{"gc_next":57802386,"memory_alloc":32886168,"memory_sys":84828440,"memory_total":826911448,"rss":285278208},"runtime":{"goroutines":265}},"filebeat":{"harvester":{"closed":0,"gzip_closed":0,"gzip_open_files":0,"gzip_running":0,"gzip_started":0,"open_files":1,"running":1,"skipped":0,"started":1},"input":{"log":{"files":{"renamed":0,"truncated":0}}}},"libbeat":{"config":{"module":{"running":0,"starts":0,"stops":0},"reloads":0,"scans":0}},"processor":{"add_host_metadata":{"fqdn_lookup_failed":0}},"registrar":{"states":{"cleanup":0,"current":0,"update":0},"writes":{"fail":0,"success":0,"total":0}},"system":{"cpu":{"cores":12},"load":{"1":0.19,"15":0.29,"5":0.19,"norm":{"1":0.0158,"15":0.0242,"5":0.0158}}}}

so it seems that the domain part is extracted even if it is a random value as debug in your case or localhost in mine

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my inability to hit /stats locally despite agent generate data this LGTM, I did notice we are probably unnecessarily registering to process Windows service control events as a sub-process but I'm not sure if that is harmful or not.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2025

curl --unix-socket /usr/share/elastic-agent/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http://localhost/stats

OK yes I remember this now, thanks that does what I expected.

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
36.3% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pkoutsovasilis

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@pkoutsovasilis pkoutsovasilis merged commit 9f15088 into elastic:main Sep 25, 2025
22 of 23 checks passed
mergify bot pushed a commit that referenced this pull request Sep 25, 2025
* feat: emit system resource metrics for EDOT subprocess

* ci: extend unit-tests to cover for the edot subprocess resource metrics stream

* feat: add standalone monitoring server in supervised EDOT

* feat: move otel execution mode feature flag to a separate package

* feat: rework otel config package to avoid globals

(cherry picked from commit 9f15088)

# Conflicts:
#	internal/pkg/agent/application/monitoring/v1_monitor.go
pkoutsovasilis added a commit that referenced this pull request Sep 25, 2025
…subprocess (#10142)

* feat: emit system resource metrics for EDOT subprocess (#10003)

* feat: emit system resource metrics for EDOT subprocess

* ci: extend unit-tests to cover for the edot subprocess resource metrics stream

* feat: add standalone monitoring server in supervised EDOT

* feat: move otel execution mode feature flag to a separate package

* feat: rework otel config package to avoid globals

(cherry picked from commit 9f15088)

# Conflicts:
#	internal/pkg/agent/application/monitoring/v1_monitor.go

* fix: resolve conflicts

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
v1v added a commit that referenced this pull request Sep 26, 2025
* upstream: (505 commits)
  Update journald tests now that Filebeat supports watching folders (#10131)
  [deploy/kubernetes]: add info about hostPID for Universal Profiling (#10173)
  Fall back to process runtime if otel runtime is unsupported (#10087)
  Conditionall check for ms_tls13kdf build tag (#10160)
  [docs][edot] add entry for profiles (#10163)
  edot/docs: add support for profiles (#10146)
  Add Logstash exporter (#10137)
  Add back publish to serverless. (#10159)
  Improve Integration test documentation (#10155)
  Fix multiarch service image push from main to serverless (#10129)
  Forward migrate action to endpoint (#9801)
  Comment out check for ms_tls13kdf tag for FIPS-capable binaries (#10148)
  [otel] add receivers: apache, iis, mysql, postgresql, sqlserver v0.135.0 (#9344)
  Add k8sevents receiver in kube-stack (#10086)
  feat: emit system resource metrics for EDOT subprocess (#10003)
  [AutoOps] Configure OTel Exporter to Send Maximum-sized Batches (#10126)
  keep enrollment token when replacing data with signed (#10115)
  Revert "Publish `elastic-agent-service` container directly to serverless from main (#9583)" (#10127)
  Add agent_policy_id and policy_revision_idx to checkin requests (#9931)
  remove resource/k8s processor and use k8sattributes processor for service attributes (#10108)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.19 Automated backport to the 8.19 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants