-
Notifications
You must be signed in to change notification settings - Fork 959
[Feature] Self Protection for Java Agent #13855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In order to avoid significant performance degradation, we did 1/1000 sampling of the above data collection logic, so the data collected is not completely accurate, but the data can reflect the approximate trend of resource consumption. For example, if there is a memory leak, the memory related self-monitoring data grow continuously. |
Relevant codes for this issue are being sorted out, and we may share more performance issue with community after we analyze the self-monitoring data. |
From experience it is quite challenging to effectively measure the overhead of the instrumentation agent for the following reasons:
I would really be happy if we could find a way to effectively measure and limit overhead at runtime, however in practice using an external measure with a reproducible synthetic load often yields sufficient results to estimate the overhead on a given application, which is not something we can easily ask to any user of instrumentation. I think that in practice there are two complimentary aspects to the performance overhead that users are often (and for very good reasons) worried about:
On the topic of minimizing overhead, in the Elastic APM Java agent, we implemented an object accounting and recycling as an attempt to minimize memory allocation, and while it does provides some benefits on the allocation and GC overhead side, the extra management required is definitely tricky and error-prone to maintain. For example, here doing the accounting and measurement of each enter/advice execution could definitely become something non trivial to implement. Also, still in the Elastic APM Java agent we implemented a rather simple but effective "circuit breaker" that makes the agent self-disable tracing and reclaim any used memory by internal queues at runtime when the CPU usage or GC execution time is over a certain threshold. Here the high resource memory usage is used as a proxy for "application is under heavy load" scenario, while that would not be strictly equivalent to what you suggest with strict resource usage limits, implementing it can provide some extra safety to prevent the agent overhead when the application is under load. The underlying assumption being that using available resources when not under load is generally not an issue, in a similar way linux OS aggressively uses available memory for filesystem cache. |
Thanks for your detailed response. Like you mentioned, accurately calculating the resource overhead of an Agent is difficult, and at the same time can lead to significant overhead. So we're essentially just calculating a resource usage trend (approximated by some sampling and other mechanisms), and only triggering the self-protection mechanism degradation probe if we find that the resource usage is very much not as expected(e.g., if the probe's memory keeps going up, or if the CPU usage exceeds a very large value, which suggests that there may be a problem with the logic of the code in the probe itself). And I've looked at the javaagent implementation of elastic before, and there's a problem: if you use an object pool, a lot of objects may go into the tenured generation, and full GC may bring more problems. How did elastic java agent solve it? |
The amount of memory used is usually proportional to the number of concurrent transactions/spans, so if the load on the application is stable then it's usually a more or less a fixed amount. If we use a rather "strict" notion of object pooling, keeping strong references can effectively create what you describe here: objects promoted to tenured space and when this amount gets large it might make GC take more time at every execution. For example, this is especially problematic when you have a batch application that have a very spiky load profile. For those object pools, we always put a hard limit on the number of objects, and the objects that are pooled are usually very simple objects that do not have a very complex object references graph, which probably explains why we don't have such issue. I think here we could probably also use weak references when objects are available in the pool, so it would not prevent the GC from collecting those when needed. With that, the amount of memory used for in-flight spans would still be directly referenced, and the available objects in the pool would be used as a "available objects cache" that would be automatically cleared by the GC when there is some memory pressure, using a mix of hard and weak references could also be relevant here. |
Instrumenting the instrumentation isn't going to make things better. 🙃
One problem with the "things are running too hot, let's taper down the instrumentation..." (or even the killswitch approach) is that it is EXACTLY those moments when you often need observability the most. Losing visibility when things get interesting is a surefire way to frustrate o11y practitioners.
While this is still instrumentation of instrumentation, the high-ratio sampling is super interesting to me. I think the same trappings of keeping overhead low and fine-tuning and avoiding memory leaks etc. in the instrumentation instrumentation apply...but it seems novel and I am interested in seeing it in practice (even as a prototype!). |
Is your feature request related to a problem? Please describe.
Related to some performance issue like:
#13720
Describe the solution you'd like
Self Protection of Java Agent
Background
Code of Java Agent is actually weaved into the user's code. So Java Agent's code will inevitably affect the user's business logic in some scenarios. In order to control and minimise the impact of the Java Agent on the user's business code, We need to go and collect the resources occupied by the Java Agent and limit the resources consumed.
Self-monitoring of resource consumption
CPU usage of instrumentation code
instrumenter
start/end method:(ot2 - ot1) / (t2 - t1)
Number of instrumentation code executions
Here is a simple count of the number of times the start and end methods of instrumenter are executed.
Memory
To estimate the memory occupied by the Java Agent, we approximate the memory occupied by the Java Agent to be the memory occupied by the Context (Context can be thought of as the root of the request) and its referenced objects plus the memory occupied by the various Caches.
Calculate the memory consumed by a Java Object
We use a BFS to calculate the memory consumed by Context
Update the memory size when the object is destroyed
Use a ReferenceQueue with a WeakReference to update the memory size when the object is destroyed(We do not use JDK cleaner or phantom because they have bug in JDK8 that may cause the GC timing, which may lead to a memory leak).
Self Protection Approach
With the data we collected above, we can do self protection by changing the behavior of
instrumenter#shouldStart
Limit the maximum CPU usage percentage of the Java Agent
Limit the maximum Memory usage percentage of the Java Agent
Limit the maximum QPS of the Java Agent(We cam limit the rate of Span generation, either overall or at the entrance(PropagatingFromUpstreamInstrumenter))
In the event of a self-protection condition, data may be lost, but the user's business will not be greatly affected.
Describe alternatives you've considered
No response
Additional context
The issue is discussed with @trask in May 15, 2025 (UTC) - APAC
The text was updated successfully, but these errors were encountered: