[Feature] Self Protection for Java Agent #13855

123liuziming · 2025-05-15T03:00:28Z

Is your feature request related to a problem? Please describe.

Related to some performance issue like:
#13720

Describe the solution you'd like

Self Protection of Java Agent

Background

Code of Java Agent is actually weaved into the user's code. So Java Agent's code will inevitably affect the user's business logic in some scenarios. In order to control and minimise the impact of the Java Agent on the user's business code, We need to go and collect the resources occupied by the Java Agent and limit the resources consumed.

Self-monitoring of resource consumption

CPU usage of instrumentation code

Collect the CPU time consumed by the instrumenter start/end method:

private final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();

long t1 = threadMXBean.getCurrentThreadCpuTime();
// instrumenter start/end
long t2 = threadMXBean.getCurrentThreadCpuTime();
// t2 - t1 is the CPU time that is consumed

private final OperatingSystemMXBean osBean = (OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean();

Collect the CPU time consumed by JVM

long ot1 = osBean.getProcessCpuTime();
long ot2 = osBean.getProcessCpuTime();
// ot2 - ot1 is the CPU time consumed by JVM

Get the CPU usage for Java Agent by(ot2 - ot1) / (t2 - t1)

Number of instrumentation code executions

Here is a simple count of the number of times the start and end methods of instrumenter are executed.

Memory

To estimate the memory occupied by the Java Agent, we approximate the memory occupied by the Java Agent to be the memory occupied by the Context (Context can be thought of as the root of the request) and its referenced objects plus the memory occupied by the various Caches.

Calculate the memory consumed by a Java Object

We use a BFS to calculate the memory consumed by Context

Update the memory size when the object is destroyed

Use a ReferenceQueue with a WeakReference to update the memory size when the object is destroyed(We do not use JDK cleaner or phantom because they have bug in JDK8 that may cause the GC timing, which may lead to a memory leak).

Self Protection Approach

With the data we collected above, we can do self protection by changing the behavior of instrumenter#shouldStart

Limit the maximum CPU usage percentage of the Java Agent
Limit the maximum Memory usage percentage of the Java Agent
Limit the maximum QPS of the Java Agent(We cam limit the rate of Span generation, either overall or at the entrance(PropagatingFromUpstreamInstrumenter))

In the event of a self-protection condition, data may be lost, but the user's business will not be greatly affected.

Describe alternatives you've considered

No response

Additional context

The issue is discussed with @trask in May 15, 2025 (UTC) - APAC

The text was updated successfully, but these errors were encountered:

123liuziming · 2025-05-15T03:06:23Z

In order to avoid significant performance degradation, we did 1/1000 sampling of the above data collection logic, so the data collected is not completely accurate, but the data can reflect the approximate trend of resource consumption. For example, if there is a memory leak, the memory related self-monitoring data grow continuously.

123liuziming · 2025-05-15T03:09:12Z

Relevant codes for this issue are being sorted out, and we may share more performance issue with community after we analyze the self-monitoring data.

SylvainJuge · 2025-05-19T12:49:23Z

From experience it is quite challenging to effectively measure the overhead of the instrumentation agent for the following reasons:

extra memory allocation by the instrumentation incurs overhead on the GC which can't directly be measured, at best we could estimate the amount of extra memory allocated but not how much CPU time was added to it.
there is often some transient in-memory storage outside of the context for the duration of the span using either thread-local or using virtual fields on the application objects.
there isn't any easy way to limit the memory usage in quantity, the best we could probably do is to use a proxy like the number of objects, and that would require to have a way to properly enforce accounting.
injecting extra classes will create some fixed size overhead in the compiled code memory area, which can't be observed in detail beyond size and usage.
by sampling the overhead measurement on traces, it implies that the overhead is linear related to the number of traces captured or the application load. In practice though there can be non-trivial processing outside of the span execution, for example with the exporters which work asynchronously.

I would really be happy if we could find a way to effectively measure and limit overhead at runtime, however in practice using an external measure with a reproducible synthetic load often yields sufficient results to estimate the overhead on a given application, which is not something we can easily ask to any user of instrumentation.

I think that in practice there are two complimentary aspects to the performance overhead that users are often (and for very good reasons) worried about:

having the ability to measure or estimate it without doing an in-depth experimentation in production
having the ability to provide an upper limit to this overhead, which include for example being able to defend against a memory leak in the instrumentation.

On the topic of minimizing overhead, in the Elastic APM Java agent, we implemented an object accounting and recycling as an attempt to minimize memory allocation, and while it does provides some benefits on the allocation and GC overhead side, the extra management required is definitely tricky and error-prone to maintain. For example, here doing the accounting and measurement of each enter/advice execution could definitely become something non trivial to implement.

Also, still in the Elastic APM Java agent we implemented a rather simple but effective "circuit breaker" that makes the agent self-disable tracing and reclaim any used memory by internal queues at runtime when the CPU usage or GC execution time is over a certain threshold. Here the high resource memory usage is used as a proxy for "application is under heavy load" scenario, while that would not be strictly equivalent to what you suggest with strict resource usage limits, implementing it can provide some extra safety to prevent the agent overhead when the application is under load. The underlying assumption being that using available resources when not under load is generally not an issue, in a similar way linux OS aggressively uses available memory for filesystem cache.

123liuziming · 2025-05-20T01:52:36Z

complimentary

Thanks for your detailed response. Like you mentioned, accurately calculating the resource overhead of an Agent is difficult, and at the same time can lead to significant overhead. So we're essentially just calculating a resource usage trend (approximated by some sampling and other mechanisms), and only triggering the self-protection mechanism degradation probe if we find that the resource usage is very much not as expected(e.g., if the probe's memory keeps going up, or if the CPU usage exceeds a very large value, which suggests that there may be a problem with the logic of the code in the probe itself).

And I've looked at the javaagent implementation of elastic before, and there's a problem: if you use an object pool, a lot of objects may go into the tenured generation, and full GC may bring more problems. How did elastic java agent solve it?

SylvainJuge · 2025-05-20T08:00:51Z

And I've looked at the javaagent implementation of elastic before, and there's a problem: if you use an object pool, a lot of objects may go into the tenured generation, and full GC may bring more problems. How did elastic java agent solve it?

The amount of memory used is usually proportional to the number of concurrent transactions/spans, so if the load on the application is stable then it's usually a more or less a fixed amount.

If we use a rather "strict" notion of object pooling, keeping strong references can effectively create what you describe here: objects promoted to tenured space and when this amount gets large it might make GC take more time at every execution. For example, this is especially problematic when you have a batch application that have a very spiky load profile.

For those object pools, we always put a hard limit on the number of objects, and the objects that are pooled are usually very simple objects that do not have a very complex object references graph, which probably explains why we don't have such issue.

I think here we could probably also use weak references when objects are available in the pool, so it would not prevent the GC from collecting those when needed. With that, the amount of memory used for in-flight spans would still be directly referenced, and the available objects in the pool would be used as a "available objects cache" that would be automatically cleared by the GC when there is some memory pressure, using a mix of hard and weak references could also be relevant here.

breedx-splk · 2025-05-21T22:48:18Z

Instrumenting the instrumentation isn't going to make things better. 🙃

having the ability to provide an upper limit to this overhead, which include for example being able to defend against a memory leak in the instrumentation.

One problem with the "things are running too hot, let's taper down the instrumentation..." (or even the killswitch approach) is that it is EXACTLY those moments when you often need observability the most. Losing visibility when things get interesting is a surefire way to frustrate o11y practitioners.

In order to avoid significant performance degradation, we did 1/1000 sampling of the above data collection logic, so the data collected is not completely accurate, but the data can reflect the approximate trend of resource consumption. For example, if there is a memory leak, the memory related self-monitoring data grow continuously.

While this is still instrumentation of instrumentation, the high-ratio sampling is super interesting to me. I think the same trappings of keeping overhead low and fine-tuning and avoiding memory leaks etc. in the instrumentation instrumentation apply...but it seems novel and I am interested in seeing it in practice (even as a prototype!).

123liuziming added enhancement New feature or request needs triage New issue that requires triage labels May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Self Protection for Java Agent #13855

[Feature] Self Protection for Java Agent #13855

123liuziming commented May 15, 2025

123liuziming commented May 15, 2025

Uh oh!

123liuziming commented May 15, 2025

Uh oh!

SylvainJuge commented May 19, 2025

Uh oh!

123liuziming commented May 20, 2025

Uh oh!

SylvainJuge commented May 20, 2025

Uh oh!

breedx-splk commented May 21, 2025

Uh oh!

[Feature] Self Protection for Java Agent #13855

[Feature] Self Protection for Java Agent #13855

Comments

123liuziming commented May 15, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Self Protection of Java Agent

Background

Self-monitoring of resource consumption

CPU usage of instrumentation code

Number of instrumentation code executions

Memory

Calculate the memory consumed by a Java Object

Update the memory size when the object is destroyed

Self Protection Approach

Describe alternatives you've considered

Additional context

123liuziming commented May 15, 2025

Uh oh!

123liuziming commented May 15, 2025

Uh oh!

SylvainJuge commented May 19, 2025

Uh oh!

123liuziming commented May 20, 2025

Uh oh!

SylvainJuge commented May 20, 2025

Uh oh!

breedx-splk commented May 21, 2025

Uh oh!