-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Exposes average batch metrics at 1, 5 and 15 minutes time window. #18460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposes average batch metrics at 1, 5 and 15 minutes time window. #18460
Conversation
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label. Could you fix it @andsel? 🙏
|
|
run exhaustive test |
…can't be created (1 minute and more intervals)
donoghuc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a few cases in the integration test? Seems like we have a place it would be easy to add
logstash/qa/integration/specs/monitoring_api_spec.rb
Lines 311 to 339 in 3659b6f
| Stud.try(max_retry.times, [StandardError, RSpec::Expectations::ExpectationNotMetError]) do | |
| # node_stats can fail if the stats subsystem isn't ready | |
| result = logstash_service.monitoring_api.node_stats rescue nil | |
| expect(result).not_to be_nil | |
| # we use fetch here since we want failed fetches to raise an exception | |
| # and trigger the retry block | |
| batch_stats = result.fetch("pipelines").fetch(pipeline_id).fetch("batch") | |
| expect(batch_stats).not_to be_nil | |
| expect(batch_stats["event_count"]).not_to be_nil | |
| expect(batch_stats["event_count"]["average"]).not_to be_nil | |
| expect(batch_stats["event_count"]["average"]["lifetime"]).not_to be_nil | |
| expect(batch_stats["event_count"]["average"]["lifetime"]).to be_a_kind_of(Numeric) | |
| expect(batch_stats["event_count"]["average"]["lifetime"]).to be > 0 | |
| expect(batch_stats["event_count"]["current"]).not_to be_nil | |
| expect(batch_stats["event_count"]["current"]).to be >= 0 | |
| expect(batch_stats["byte_size"]).not_to be_nil | |
| expect(batch_stats["byte_size"]["average"]).not_to be_nil | |
| expect(batch_stats["byte_size"]["average"]["lifetime"]).not_to be_nil | |
| expect(batch_stats["byte_size"]["average"]["lifetime"]).to be_a_kind_of(Numeric) | |
| expect(batch_stats["byte_size"]["average"]["lifetime"]).to be > 0 | |
| expect(batch_stats["byte_size"]["current"]).not_to be_nil | |
| expect(batch_stats["byte_size"]["current"]).to be >= 0 | |
| end | |
| end | |
| end |
|
That would be the right point, but verifying at least the 1 minute average flow metric would be that the test need to sleep and wait for 1 minute so that the value pops up in the response. Waiting for 5 or 15 minutes to execute a test would be great waste, WDYT? |
| # reading and retrieve unrelated values | ||
| current_data_point = stats[:batch][:current] | ||
| { | ||
| # average return a FlowMetric which and we need to invoke getValue to obtain the map with metric details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a hard time parsing this comment, not sure exactly what the message is supposed to convey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more a reminder for the reader why for this metric we need to invoke the getValue method.
The reason why resides in the fact that the flow metric nested in batch/[event_count | byte_szie]/average, to return the sub-document which contains the lifetime, last_1_minute etc metric values needs to be explicitly queried to return the map with those values:
logstash/logstash-core/src/main/java/org/logstash/instrument/metrics/ExtendedFlowMetric.java
Line 89 in 279171b
| public Map<String, Double> getValue() { |
Usually the paths drive to single value metric objects, but this a composite that needs explicit query to grab the contained values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donoghuc let me know if you want me to reword the sentence so that could sound more meaningful (a good suggestion from a native speaker is very welcome :-) )
donoghuc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i did not consider the time cost of that test 😅
Local testing shows it is working as advertised
➜ ~ curl -s http://localhost:9600/_node/stats/pipelines | jq '.pipelines.main.batch'
{
"event_count": {
"current": 0,
"average": {
"lifetime": 1,
"last_1_minute": 1,
"last_5_minutes": 1,
"last_15_minutes": 1
}
},
"byte_size": {
"current": 0,
"average": {
"lifetime": 1428,
"last_1_minute": 1683,
"last_5_minutes": 1448,
"last_15_minutes": 1445
}
}
}
Co-authored-by: Cas Donoghue <[email protected]>
💚 Build Succeeded
History
cc @andsel |
Release notes
Exposes batch size metrics for last 1, 5 and 15 minutes.
What does this PR do?
Updates stats API response to expose also 1m, 5m and 15m average batch metrics.
Changed the response map returned by
refine_batch_metricsmethod as result of API query to_node/statsso tha contains the average values of last 1, 5 and 15 minutes forevent_countandbatch_size. These data is published once they are available from the metric collector.Why is it important/What is the impact to the user?
This feature permit to the user of Logstash to have the metering of batch average values over some recent time windows.
Checklist
[ ] I have made corresponding change to the default configuration files (and/or docker env variables)[ ] I have added tests that prove my fix is effective or that my feature worksThis feature rely onExtendedFlowMetricwhich is extensively tested about these time window management. To create a test at API level we should implement something that load for at least the time window duration and check the API response. Test that runs for minutes are not feasible.Author's Checklist
How to test this PR locally
Use the same test harness proposed in #18000, switch
pipeline.batch.metrics.sampling_modetofulland monitor for 1, 5, and 15 minutes the result of_node/statswith:curl http://localhost:9600/_node/stats | jq .pipelines.main.batchRelated issues
Use cases
Screenshots
Logs