Skip to content

[Feature] Self Protection for Java Agent #13855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
123liuziming opened this issue May 15, 2025 · 6 comments
Open

[Feature] Self Protection for Java Agent #13855

123liuziming opened this issue May 15, 2025 · 6 comments
Labels
enhancement New feature or request needs triage New issue that requires triage

Comments

@123liuziming
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Related to some performance issue like:
#13720

Describe the solution you'd like

Self Protection of Java Agent

Background

Code of Java Agent is actually weaved into the user's code. So Java Agent's code will inevitably affect the user's business logic in some scenarios. In order to control and minimise the impact of the Java  Agent on the user's business code, We need to go and collect the resources occupied by the Java Agent and limit the resources consumed.

Self-monitoring of resource consumption

CPU usage of instrumentation code

  1. Collect the CPU time consumed by the instrumenter start/end method:
private final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();

long t1 = threadMXBean.getCurrentThreadCpuTime();
// instrumenter start/end
long t2 = threadMXBean.getCurrentThreadCpuTime();
// t2 - t1 is the CPU time that is consumed

private final OperatingSystemMXBean osBean = (OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean();
  1. Collect the CPU time consumed by JVM
long ot1 = osBean.getProcessCpuTime();
long ot2 = osBean.getProcessCpuTime();
// ot2 - ot1 is the CPU time consumed by JVM
  1. Get the CPU usage for Java Agent by(ot2 - ot1) / (t2 - t1)

Number of instrumentation code executions

Here is a simple count of the number of times the start and end methods of instrumenter are executed.

Memory

To estimate the memory occupied by the Java Agent, we approximate the memory occupied by the Java Agent to be the memory occupied by the Context (Context can be thought of as the root of the request) and its referenced objects plus the memory occupied by the various Caches.

Calculate the memory consumed by a Java Object

image.png

We use a BFS to calculate the memory consumed by Context

image

Update the memory size when the object is destroyed

Use a ReferenceQueue with a WeakReference to update the memory size when the object is destroyed(We do not use JDK cleaner or phantom because they have bug in JDK8 that may cause the GC timing, which may lead to a memory leak).

image

Self Protection Approach

With the data we collected above, we can do self protection by changing the behavior of instrumenter#shouldStart

  1. Limit the maximum CPU usage percentage of the Java Agent

  2. Limit the maximum Memory usage percentage of the Java Agent

  3. Limit the maximum QPS of the Java Agent(We cam limit the rate of Span generation, either overall or at the entrance(PropagatingFromUpstreamInstrumenter))

In the event of a self-protection condition, data may be lost, but the user's business will not be greatly affected.

Describe alternatives you've considered

No response

Additional context

The issue is discussed with @trask in May 15, 2025 (UTC) - APAC

@123liuziming 123liuziming added enhancement New feature or request needs triage New issue that requires triage labels May 15, 2025
@123liuziming
Copy link
Contributor Author

In order to avoid significant performance degradation, we did 1/1000 sampling of the above data collection logic, so the data collected is not completely accurate, but the data can reflect the approximate trend of resource consumption. For example, if there is a memory leak, the memory related self-monitoring data grow continuously.

@123liuziming
Copy link
Contributor Author

Relevant codes for this issue are being sorted out, and we may share more performance issue with community after we analyze the self-monitoring data.

@SylvainJuge
Copy link
Contributor

From experience it is quite challenging to effectively measure the overhead of the instrumentation agent for the following reasons:

  • extra memory allocation by the instrumentation incurs overhead on the GC which can't directly be measured, at best we could estimate the amount of extra memory allocated but not how much CPU time was added to it.
  • there is often some transient in-memory storage outside of the context for the duration of the span using either thread-local or using virtual fields on the application objects.
  • there isn't any easy way to limit the memory usage in quantity, the best we could probably do is to use a proxy like the number of objects, and that would require to have a way to properly enforce accounting.
  • injecting extra classes will create some fixed size overhead in the compiled code memory area, which can't be observed in detail beyond size and usage.
  • by sampling the overhead measurement on traces, it implies that the overhead is linear related to the number of traces captured or the application load. In practice though there can be non-trivial processing outside of the span execution, for example with the exporters which work asynchronously.

I would really be happy if we could find a way to effectively measure and limit overhead at runtime, however in practice using an external measure with a reproducible synthetic load often yields sufficient results to estimate the overhead on a given application, which is not something we can easily ask to any user of instrumentation.

I think that in practice there are two complimentary aspects to the performance overhead that users are often (and for very good reasons) worried about:

  • having the ability to measure or estimate it without doing an in-depth experimentation in production
  • having the ability to provide an upper limit to this overhead, which include for example being able to defend against a memory leak in the instrumentation.

On the topic of minimizing overhead, in the Elastic APM Java agent, we implemented an object accounting and recycling as an attempt to minimize memory allocation, and while it does provides some benefits on the allocation and GC overhead side, the extra management required is definitely tricky and error-prone to maintain. For example, here doing the accounting and measurement of each enter/advice execution could definitely become something non trivial to implement.

Also, still in the Elastic APM Java agent we implemented a rather simple but effective "circuit breaker" that makes the agent self-disable tracing and reclaim any used memory by internal queues at runtime when the CPU usage or GC execution time is over a certain threshold. Here the high resource memory usage is used as a proxy for "application is under heavy load" scenario, while that would not be strictly equivalent to what you suggest with strict resource usage limits, implementing it can provide some extra safety to prevent the agent overhead when the application is under load. The underlying assumption being that using available resources when not under load is generally not an issue, in a similar way linux OS aggressively uses available memory for filesystem cache.

@123liuziming
Copy link
Contributor Author

complimentary

Thanks for your detailed response. Like you mentioned, accurately calculating the resource overhead of an Agent is difficult, and at the same time can lead to significant overhead. So we're essentially just calculating a resource usage trend (approximated by some sampling and other mechanisms), and only triggering the self-protection mechanism degradation probe if we find that the resource usage is very much not as expected(e.g., if the probe's memory keeps going up, or if the CPU usage exceeds a very large value, which suggests that there may be a problem with the logic of the code in the probe itself).

And I've looked at the javaagent implementation of elastic before, and there's a problem: if you use an object pool, a lot of objects may go into the tenured generation, and full GC may bring more problems. How did elastic java agent solve it?

@SylvainJuge
Copy link
Contributor

And I've looked at the javaagent implementation of elastic before, and there's a problem: if you use an object pool, a lot of objects may go into the tenured generation, and full GC may bring more problems. How did elastic java agent solve it?

The amount of memory used is usually proportional to the number of concurrent transactions/spans, so if the load on the application is stable then it's usually a more or less a fixed amount.

If we use a rather "strict" notion of object pooling, keeping strong references can effectively create what you describe here: objects promoted to tenured space and when this amount gets large it might make GC take more time at every execution. For example, this is especially problematic when you have a batch application that have a very spiky load profile.

For those object pools, we always put a hard limit on the number of objects, and the objects that are pooled are usually very simple objects that do not have a very complex object references graph, which probably explains why we don't have such issue.

I think here we could probably also use weak references when objects are available in the pool, so it would not prevent the GC from collecting those when needed. With that, the amount of memory used for in-flight spans would still be directly referenced, and the available objects in the pool would be used as a "available objects cache" that would be automatically cleared by the GC when there is some memory pressure, using a mix of hard and weak references could also be relevant here.

@breedx-splk
Copy link
Contributor

Instrumenting the instrumentation isn't going to make things better. 🙃

having the ability to provide an upper limit to this overhead, which include for example being able to defend against a memory leak in the instrumentation.

One problem with the "things are running too hot, let's taper down the instrumentation..." (or even the killswitch approach) is that it is EXACTLY those moments when you often need observability the most. Losing visibility when things get interesting is a surefire way to frustrate o11y practitioners.

In order to avoid significant performance degradation, we did 1/1000 sampling of the above data collection logic, so the data collected is not completely accurate, but the data can reflect the approximate trend of resource consumption. For example, if there is a memory leak, the memory related self-monitoring data grow continuously.

While this is still instrumentation of instrumentation, the high-ratio sampling is super interesting to me. I think the same trappings of keeping overhead low and fine-tuning and avoiding memory leaks etc. in the instrumentation instrumentation apply...but it seems novel and I am interested in seeing it in practice (even as a prototype!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage New issue that requires triage
Projects
None yet
Development

No branches or pull requests

3 participants