-
Notifications
You must be signed in to change notification settings - Fork 193
feat: emit system resource metrics for EDOT subprocess #10003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: emit system resource metrics for EDOT subprocess #10003
Conversation
e05ba2f
to
004abb2
Compare
004abb2
to
a4da5e0
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
From what I know/remember, I think this metrics.elastic-agent/collector-* index will have to be in the list of well known monitoring indices in Fleet defined here in Kibana, and it'll also have to be defined as a data stream in the elastic_agent package. I do not have the entire history on why the monitoring indices need to be defined in Kibana, but I don't think we'll have Fleet managed output API keys generated with permissions to write to the new data stream without that. Edit: The test that would prove the need for this is to make sure this data stream can be written to for a Fleet managed agent. |
There is an existing test that I think could be easily updated to prove this works e2e for Fleet managed agents, by expecting the elastic-agent/collector component to have system metrics populated. elastic-agent/testing/integration/ess/metrics_monitoring_test.go Lines 95 to 103 in 58460e7
|
cf3c623
to
4cd10aa
Compare
Look, the resource metrics for the EDOT subprocess are being written to the existing elastic-agent.elastic-agent dataset, so we don’t need to make any changes on the Kibana side. I didn’t do anything special, and the metrics show up correctly in the Agent Metrics dashboard.
As discussed offline with @cmacknz, since this setting is short-lived, we’ll handle the end-to-end testing in a follow-up PR when subprocess execution mode becomes the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, one comment about monitoring config reloading and a question about the new monitoring server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, waiting for ci to be green
@kaanyalti the CI won't get green as the
which I am not quite sure what causes that but I am fairly confident that isn't an error due to the functionality introduced by this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to block this PR from merging as it looks good, other than my 1 inline comment.
Built locally and can see it's shipping metrics 👍 I did notice when inspecting diagnostics that we don't get profiles from the collector sub-process, so I created #10135 to track including these separately from this PR and added it to the parent issue for the sub-process conversion. |
Thanks for creating the tracking issue in my mind, this could/should be covered by the EDOT diagnostics extension PR here |
OK I'm not as concerned as which piece of work it is part of, more with ensuring it gets implemented. |
c181a17
Out of curiosity I tried to hit the unix socket myself, and I'm getting a 404 on
|
@cmacknz I think that this is quirk of # curl --unix-socket /usr/share/elastic-agent/data/tmp/jTgjFSdPDnYoazUu541NazzsJSv6daDn.sock http://localhost/stats
{"beat":{"cgroup":{"cpu":{"id":"/","stats":{"periods":0,"throttled":{"ns":0,"periods":0}}},"memory":{"id":"/","mem":{"usage":{"bytes":4232740864}}}},"cpu":{"system":{"ticks":6780,"time":{"ms":6780}},"total":{"ticks":34140,"time":{"ms":34140},"value":34140},"user":{"ticks":27360,"time":{"ms":27360}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":33},"info":{"ephemeral_id":"893ffa32-6758-4393-9e32-4f627378b4a1","name":"elastic-agent/c","uptime":{"ms":3293315},"version":"9.2.0"},"memstats":{"gc_next":57802386,"memory_alloc":32886168,"memory_sys":84828440,"memory_total":826911448,"rss":285278208},"runtime":{"goroutines":265}},"filebeat":{"harvester":{"closed":0,"gzip_closed":0,"gzip_open_files":0,"gzip_running":0,"gzip_started":0,"open_files":1,"running":1,"skipped":0,"started":1},"input":{"log":{"files":{"renamed":0,"truncated":0}}}},"libbeat":{"config":{"module":{"running":0,"starts":0,"stops":0},"reloads":0,"scans":0}},"processor":{"add_host_metadata":{"fqdn_lookup_failed":0}},"registrar":{"states":{"cleanup":0,"current":0,"update":0},"writes":{"fail":0,"success":0,"total":0}},"system":{"cpu":{"cores":12},"load":{"1":0.19,"15":0.29,"5":0.19,"norm":{"1":0.0158,"15":0.0242,"5":0.0158}}}} so it seems that the domain part is extracted even if it is a random value as debug in your case or localhost in mine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides my inability to hit /stats locally despite agent generate data this LGTM, I did notice we are probably unnecessarily registering to process Windows service control events as a sub-process but I'm not sure if that is harmful or not.
OK yes I remember this now, thanks that does what I expected. |
|
💚 Build Succeeded
History
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
* feat: emit system resource metrics for EDOT subprocess * ci: extend unit-tests to cover for the edot subprocess resource metrics stream * feat: add standalone monitoring server in supervised EDOT * feat: move otel execution mode feature flag to a separate package * feat: rework otel config package to avoid globals (cherry picked from commit 9f15088) # Conflicts: # internal/pkg/agent/application/monitoring/v1_monitor.go
…subprocess (#10142) * feat: emit system resource metrics for EDOT subprocess (#10003) * feat: emit system resource metrics for EDOT subprocess * ci: extend unit-tests to cover for the edot subprocess resource metrics stream * feat: add standalone monitoring server in supervised EDOT * feat: move otel execution mode feature flag to a separate package * feat: rework otel config package to avoid globals (cherry picked from commit 9f15088) # Conflicts: # internal/pkg/agent/application/monitoring/v1_monitor.go * fix: resolve conflicts --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
* upstream: (505 commits) Update journald tests now that Filebeat supports watching folders (#10131) [deploy/kubernetes]: add info about hostPID for Universal Profiling (#10173) Fall back to process runtime if otel runtime is unsupported (#10087) Conditionall check for ms_tls13kdf build tag (#10160) [docs][edot] add entry for profiles (#10163) edot/docs: add support for profiles (#10146) Add Logstash exporter (#10137) Add back publish to serverless. (#10159) Improve Integration test documentation (#10155) Fix multiarch service image push from main to serverless (#10129) Forward migrate action to endpoint (#9801) Comment out check for ms_tls13kdf tag for FIPS-capable binaries (#10148) [otel] add receivers: apache, iis, mysql, postgresql, sqlserver v0.135.0 (#9344) Add k8sevents receiver in kube-stack (#10086) feat: emit system resource metrics for EDOT subprocess (#10003) [AutoOps] Configure OTel Exporter to Send Maximum-sized Batches (#10126) keep enrollment token when replacing data with signed (#10115) Revert "Publish `elastic-agent-service` container directly to serverless from main (#9583)" (#10127) Add agent_policy_id and policy_revision_idx to checkin requests (#9931) remove resource/k8s processor and use k8sattributes processor for service attributes (#10108) ...
What does this PR do?
This PR introduces support for emitting system resource metrics for the EDOT (Elastic Distribution of OpenTelemetry) collector when it runs as a subprocess of the Elastic Agent.
Key changes:
features.agent.otel.subprocess_execution
feature flag to control whether the OTel collector runs in subprocess execution mode.false
for now (maintaining existing behaviour), but is expected to default totrue
in the imminent future.elastic-agent/collector
, capturing only its system resources usage.OTelManager
construction to honor the execution mode parsed from the feature flags, rather than always running in embedded mode.Why is it important?
Running the collector as a subprocess improves resilience by isolating the control plane from the data plane. Emitting metrics for the EDOT process ensures operational visibility, allowing users to observe and troubleshoot its resource usage independently from the main Elastic Agent process.
Checklist
./changelog/fragments
using the changelog toolDisruptive User Impact
No disruptive impact expected.
features.agent.otel.subprocess_execution
isfalse
(default), behavior is unchanged.elastic-agent/collector
.How to test this PR locally
elastic-agent.yml
:Related issues
N/A