Skip to content

stats: support custom histogram bucket #2539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

zirain
Copy link
Member

@zirain zirain commented Nov 2, 2022

after istio/istio#41441, add first class api to support custom histogram bucket

cc @kyessenov

@zirain zirain requested a review from a team as a code owner November 2, 2022 07:55
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 2, 2022
@zirain zirain added the release-notes-none Indicates a PR that does not require release notes. label Nov 2, 2022
message ProxyStatsHistogramBucketSetting {
// Specifies a matcher for stats and the buckets that matching stats should use.
// The match is applied to the original stat name before tag-extraction,
// for example `cluster.exampleclustername.upstream_cx_length_ms`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply for istio_request_total type metrics? Or only envoy ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

istio/istio#41441 change the implement of istio.stats (with Telemetry API), after that this work for all stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Telemetry already has a way of selecting metrics. I am a bit worried about using a different mechanism..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be applied to counter metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only works for Histogram metric

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to re-iterate my concern that we are creating a new mechanism for selecting metrics when we already have one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike Telemetry API, The match is applied to the original stat name before tag-extraction

@zirain
Copy link
Member Author

zirain commented Nov 3, 2022

@hzxuzhonghu @howardjohn should support this on ProxyConfig?

@hzxuzhonghu
Copy link
Member

ProxyConfig should only include xds client metadata, but later it is extended to include some other data plane config, which is not covered by API. For this case, telemetry api can apply to a specific proxy, so I feel proxy-related metric setting should be there.

@zirain
Copy link
Member Author

zirain commented Nov 4, 2022

ProxyConfig should only include xds client metadata, but later it is extended to include some other data plane config, which is not covered by API. For this case, telemetry api can apply to a specific proxy, so I feel proxy-related metric setting should be there.

this's actually part of statsconfig in bootstrap, similar to ProxyStatsMatcher.InclusionPrefixes

option go_package="istio.io/api/type/v1alpha1";

// Describes how to match a given string. Match is case-sensitive.
message StringMatch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is already such a API, which is used by VS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is two message StringMatch in this repo, they may share same API just like WorkloadSelector?

@hzxuzhonghu
Copy link
Member

this's actually part of statsconfig in bootstrap, similar to ProxyStatsMatcher.InclusionPrefixes

IC,

@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 9, 2022
@kyessenov
Copy link
Contributor

LGTM. I think it's useful for customizing standard Envoy metrics, too, so it can't be in just the stats filter. Looks like it's mostly verbatim copied into bootstrap, so it should be noted that it requires a proxy restart.

@zirain
Copy link
Member Author

zirain commented Nov 10, 2022

LGTM. I think it's useful for customizing standard Envoy metrics, too, so it can't be in just the stats filter. Looks like it's mostly verbatim copied into bootstrap, so it should be noted that it requires a proxy restart.

update notes

@nrjpoddar
Copy link
Member

This is a bit confusing - We were trying to get away from ProxyConfig options for Telemetry in lieu of Telemetry API. Why is the new API not used?

Secondly, it's unclear what all metrics does it effect. What happens if you update a histogram metrics buckets? Can you make sense of old and new data? Does it use the same name?

@zirain
Copy link
Member Author

zirain commented Nov 14, 2022

This is a bit confusing - We were trying to get away from ProxyConfig options for Telemetry in lieu of Telemetry API. Why is the new API not used?

this's actually part of bootstrap, Temeletry API cannot achieve

Secondly, it's unclear what all metrics does it effect. What happens if you update a histogram metrics buckets? Can you make sense of old and new data? Does it use the same name?

it will not change metrics name, only change the default buckets of histogram metric.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 22, 2022
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 23, 2022
@zirain
Copy link
Member Author

zirain commented Dec 1, 2022

kindly ping @nrjpoddar

@zirain
Copy link
Member Author

zirain commented Dec 9, 2022

Telemetry API cannot change configuration in bootstrap, BDS(bootstrap discovery request) also will not help in this scenario.

Copy link
Member

@nrjpoddar nrjpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but would need Telemetry WG experts to give their approval.

// Specifies a matcher for stats and the buckets that matching stats should use.
// The match is applied to the original stat name before tag-extraction,
// for example `cluster.exampleclustername.upstream_cx_length_ms`.
istio.type.v1alpha1.StringMatch match = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a new type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is two protoo StringMatch in this repo, they may share same API just like WorkloadSelector?

I can reuse the one with VS.

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few concerns here:

  • ProxyConfig vs Telemetry - not sure there is a clear answer on why it's kept in ProxyConfig
  • the regex matching - I don't think it's a good idea in general, but for metrics in particular.
  • the shape of the API - I think it's reasonable to customize metrics, but I would expect an
    API where for each metric (by name - we don't have that many) we would have some settings associated to that metric - buckets, etc.

But most important concern: we are in the middle of few major changes, with
OpenTelemetry maturing, Ambient, K8S Gateway happening. We also know that
Istio telemetry 'cardinality' and miss-match with other telemetry standards is a problem.
One of the options to improve and better integrate with OTel - for example generate the of the high-cardinality metrics from otel access logs ( historically istio metrics were designed with Mixer - which was access log based, so less of a problem ).

I would really like to see a doc or have a discussion about the future of telemetry and
how it fits with the evolving world around us - and which APIs belong in Istio and which
in open telemetry.

@costinm
Copy link
Contributor

costinm commented Dec 9, 2022

Regarding the bootstrap - I think this is an implementation detail of Envoy that leaks into API without any good reason.

We can use any of the Istio APIs in bootstrap - just need to inject them into the pod like we do for ProxyConfig, or get them over XDS before starting envoy ( agent already gets the root CAs from Istiod ). We can get all the config in agent
already.

Ambient on the other hand is likely to not have any injection at all - and a lot of the changes that depend on injecting
to a pod will no longer work ( Waypoint is shared ). I think any new API should take into account ambient, given
the major simplification and perf improvements it brings.

@zirain
Copy link
Member Author

zirain commented Dec 10, 2022

ProxyConfig vs Telemetry - not sure there is a clear answer on why it's kept in ProxyConfig

AFAIK, Telemetry used to generate configuration for stats filter, and as a part of bootstrap config, that's why I put this on ProxyConfig.

the regex matching - I don't think it's a good idea in general, but for metrics in particular.

I agree that regex it's not a good idea, but it's widely used in envoy statistic system.

I would really like to see a doc or have a discussion about the future of telemetry and
how it fits with the evolving world around us - and which APIs belong in Istio and which
in open telemetry.

that's another topic, which is too large for this PR (IMO, this's minor improvement)
Before v1.17, this won't work on a wasm filter, it's not a bad good idea to provide an api from the beginning of switching to native filter.

@costinm
Copy link
Contributor

costinm commented Dec 10, 2022 via email

@kyessenov
Copy link
Contributor

kyessenov commented Dec 10, 2022

To resolve some ambiguities:

  1. Histogram buckets affect all sinks (OTel in the future, prometheus now), except Wasm ones (due to ABI). There's definitive user interest, so whatever chooses to implement Telemetry API must do it (e.g. if ztunnel chooses to use telemetry API, which is not given).
  2. Telemetry API should define them for Istio metrics, and we need to ask users to use BDS or set them before injection. @costinm this would require capability to read telemetry APIs in injection for latter.
  3. Leave Envoy metrics alone, continue using BDS or custom bootstrap which works already. I'd argue to actually get rid of most of ProxyConfig and just use BDS with EnvoyFilters but we haven't made BDS the default.

@lei-tang
Copy link
Contributor

LGTM.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 19, 2022
@istio-testing
Copy link
Collaborator

@zirain: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zirain
Copy link
Member Author

zirain commented Dec 23, 2022

cannot make an agreement, close this first.

@zirain zirain closed this Dec 23, 2022
@zirain zirain deleted the histogram_bucket branch September 15, 2023 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR needs to be rebased before being merged release-notes-none Indicates a PR that does not require release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants