Skip to content

Conversation

@andsel
Copy link
Contributor

@andsel andsel commented Dec 4, 2025

Release notes

Exposes batch size metrics for last 1, 5 and 15 minutes.

What does this PR do?

Updates stats API response to expose also 1m, 5m and 15m average batch metrics.

Changed the response map returned by refine_batch_metrics method as result of API query to _node/stats so tha contains the average values of last 1, 5 and 15 minutes for event_count and batch_size. These data is published once they are available from the metric collector.

Why is it important/What is the impact to the user?

This feature permit to the user of Logstash to have the metering of batch average values over some recent time windows.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • [ ] I have added tests that prove my fix is effective or that my feature works This feature rely on ExtendedFlowMetric which is extensively tested about these time window management. To create a test at API level we should implement something that load for at least the time window duration and check the API response. Test that runs for minutes are not feasible.

Author's Checklist

  • [ ]

How to test this PR locally

Use the same test harness proposed in #18000, switch pipeline.batch.metrics.sampling_mode to full and monitor for 1, 5, and 15 minutes the result of _node/stats with:

curl http://localhost:9600/_node/stats | jq .pipelines.main.batch

Related issues

Use cases

Screenshots

Logs

@andsel andsel self-assigned this Dec 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • /run exhaustive tests : Run the exhaustive tests Buildkite pipeline.

@mergify
Copy link
Contributor

mergify bot commented Dec 4, 2025

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

@andsel andsel changed the title Updates stats API response to expose also 1m, 5m and 15m average batc… Exposes average batch metrics at 1, 5 and 15 minutes time window. Dec 4, 2025
@andsel andsel marked this pull request as draft December 4, 2025 16:51
@andsel
Copy link
Contributor Author

andsel commented Dec 4, 2025

run exhaustive test

…can't be created (1 minute and more intervals)
@andsel andsel marked this pull request as ready for review December 5, 2025 10:53
Copy link
Member

@donoghuc donoghuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a few cases in the integration test? Seems like we have a place it would be easy to add

Stud.try(max_retry.times, [StandardError, RSpec::Expectations::ExpectationNotMetError]) do
# node_stats can fail if the stats subsystem isn't ready
result = logstash_service.monitoring_api.node_stats rescue nil
expect(result).not_to be_nil
# we use fetch here since we want failed fetches to raise an exception
# and trigger the retry block
batch_stats = result.fetch("pipelines").fetch(pipeline_id).fetch("batch")
expect(batch_stats).not_to be_nil
expect(batch_stats["event_count"]).not_to be_nil
expect(batch_stats["event_count"]["average"]).not_to be_nil
expect(batch_stats["event_count"]["average"]["lifetime"]).not_to be_nil
expect(batch_stats["event_count"]["average"]["lifetime"]).to be_a_kind_of(Numeric)
expect(batch_stats["event_count"]["average"]["lifetime"]).to be > 0
expect(batch_stats["event_count"]["current"]).not_to be_nil
expect(batch_stats["event_count"]["current"]).to be >= 0
expect(batch_stats["byte_size"]).not_to be_nil
expect(batch_stats["byte_size"]["average"]).not_to be_nil
expect(batch_stats["byte_size"]["average"]["lifetime"]).not_to be_nil
expect(batch_stats["byte_size"]["average"]["lifetime"]).to be_a_kind_of(Numeric)
expect(batch_stats["byte_size"]["average"]["lifetime"]).to be > 0
expect(batch_stats["byte_size"]["current"]).not_to be_nil
expect(batch_stats["byte_size"]["current"]).to be >= 0
end
end
end

@andsel
Copy link
Contributor Author

andsel commented Dec 9, 2025

That would be the right point, but verifying at least the 1 minute average flow metric would be that the test need to sleep and wait for 1 minute so that the value pops up in the response. Waiting for 5 or 15 minutes to execute a test would be great waste, WDYT?

@andsel andsel requested a review from donoghuc December 9, 2025 15:49
# reading and retrieve unrelated values
current_data_point = stats[:batch][:current]
{
# average return a FlowMetric which and we need to invoke getValue to obtain the map with metric details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a hard time parsing this comment, not sure exactly what the message is supposed to convey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more a reminder for the reader why for this metric we need to invoke the getValue method.
The reason why resides in the fact that the flow metric nested in batch/[event_count | byte_szie]/average, to return the sub-document which contains the lifetime, last_1_minute etc metric values needs to be explicitly queried to return the map with those values:

Usually the paths drive to single value metric objects, but this a composite that needs explicit query to grab the contained values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donoghuc let me know if you want me to reword the sentence so that could sound more meaningful (a good suggestion from a native speaker is very welcome :-) )

Copy link
Member

@donoghuc donoghuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i did not consider the time cost of that test 😅

Local testing shows it is working as advertised

➜  ~ curl -s http://localhost:9600/_node/stats/pipelines | jq '.pipelines.main.batch'
{
  "event_count": {
    "current": 0,
    "average": {
      "lifetime": 1,
      "last_1_minute": 1,
      "last_5_minutes": 1,
      "last_15_minutes": 1
    }
  },
  "byte_size": {
    "current": 0,
    "average": {
      "lifetime": 1428,
      "last_1_minute": 1683,
      "last_5_minutes": 1448,
      "last_15_minutes": 1445
    }
  }
}

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @andsel

@andsel andsel merged commit e0acfe7 into elastic:main Dec 11, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose the meter of average value of batch's byte size and event count for 1m, 5m 15m windows

4 participants