Skip to content

Conversation

jiangzho
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds metrics tracking state transition latency for Spark Application in format of

sparkapp.latency.from.<fromState>.to.<toState>

Why are the changes needed?

Latency measuring would be useful to analyze the performance from scheduling / orchestration perspective. For example, to analyze which state causes significant overhad and therefore optimize at cluster/app level.

Does this PR introduce any user-facing change?

More metrics becomes available.

How was this patch tested?

CIs. New unit test added.

Was this patch authored or co-authored using generative AI tooling?

No

…ansition

### What changes were proposed in this pull request?

This PR adds metrics tracking state transition latency for Spark Application in format of

```
sparkapp.latency.from.<fromState>.to.<toState>
```

### Why are the changes needed?

Latency measuring would be useful to analyze the performance from scheduling / orchestration perspective. For example, to analyze which state causes significant overhad and therefore optimize at cluster/app level.

### Does this PR introduce _any_ user-facing change?

More metrics becomes available.

### How was this patch tested?

CIs. New unit test added.

### Was this patch authored or co-authored using generative AI tooling?

No
@jiangzho
Copy link
Contributor Author

cc @peter-toth for review

@peter-toth
Copy link

Thanks @jiangzho for the PR. I am travelling for the rest of the week, but I can review this PR once I'm back.

@@ -104,14 +98,17 @@ public void receivedEvent(Event event, Map<String, Object> metadata) {
getResourceClass(metadata);
final Optional<String> namespaceOptional = event.getRelatedCustomResourceID().getNamespace();
resource.ifPresent(
aClass -> getCounter(aClass, action.name().toLowerCase(), RESOURCE, EVENT).inc());
aClass ->
getCounter(getMetricName(aClass, action.name().toLowerCase(), RESOURCE, EVENT))
Copy link

@peter-toth peter-toth Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you extracted BaseOperatorSource and refactored getCounter() / getGauge() / getTimer() to have only one metricName argument, you need to call getMetricName() almost as many times as you call the above APIs.
I wonder if you could keep an old methods to avoid the changes at these callsites.

If you really need the new, one argument versions, then you can add them as overloaded methods.

@peter-toth
Copy link

The really like the idea, but let's try polish the implementation a bit.

@peter-toth peter-toth closed this in cdcef65 Sep 4, 2025
@peter-toth
Copy link

Thanks @jiangzho for the PR.

Merged to main (0.5.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants