-
Notifications
You must be signed in to change notification settings - Fork 284
feat(outbound): Add response metrics to policy router #3086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The outbound policy router includes a requests counter that measures the number
of requests dispatched to each route-backend; but this does not provide
visibility into success rate or response time. Before introducing timeouts and
retires on outbound routes, this change introduces visibility into per-route
response metrics.
The route_request_statuses counters measure responses from the application's
point of view. Once retries are introduced, this will provide visibility into
the _effective_ success rate of each route.
outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0
A coarse histogram is introduced at this scope to track the total duration of
requests dispatched to each route, covering all retries and all response stream
processing:
outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
outbound_http_route_request_duration_seconds_count{parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0
The route_backend_response_statuses counters measure the responses from
individual backends. This reflects the _actual_ success rate of each route as
served by the backend services.
outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0
A slightly more detailed histogram is introduced at this scope to track the time
spend processing responses from each backend (i.e. after the request has been
fully dispatched):
outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0
Note that duration histograms omit status code labels, as they needlessly
inflate metrics cardinality. The histograms that we have introduced here are
generally much more constrained, as we much choose broadly applicable buckets
and want to avoid cardinality explosion when many routes are used.
cratelyn
added a commit
that referenced
this pull request
Apr 9, 2025
this commit fixes a bug discovered by @alpeb, which was introduced in proxy v2.288.0. > The associated metric is `outbound_http_route_request_statuses_total`: > > ``` > $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10 > ``` > > The problem was introduced in `edge-25.3.4`, with the proxy `v2.288.0`. > Before that the metrics looked like: > > ``` > $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96 > ``` > > So the difference is the non-empty value for `error=UNKNOWN` even > when `https_status` is 2xx, which `linkerd viz stat-outbound` > interprets as failed requests. in #3086 we introduced a suite of route- and backend-level metrics. that subsystem contains a body middleware that will report itself as having reached the end-of-stream by delegating directly down to its inner body's `is_end_stream()` hint. this is roughly correct, but is slightly distinct from the actual invariant: a `linkerd_http_prom::record_response::ResponseBody<B>` must call its `end_stream` helper to classify the outcome and increment the corresponding time series in the `outbound_http_route_request_statuses_total` metric family. in #3504 we upgraded our hyper dependency. while doing so, we neglected to include a call to `end_stream` if a data frame is yielded and the inner body reports itself as having reached the end-of-stream. this meant that instrumented bodies would be polled until the end is reached, but were being dropped before a `None` was encountered. this commit fixes this issue in two ways, to be defensive: * invoke `end_stream()` if a non-trailers frame is yielded, and the inner body now reports itself as having ended. this restores the behavior in place prior to #3504. see the relevant component of that diff, here: <https://github.com/linkerd/linkerd2-proxy/pull/3504/files#diff-45d0bc344f76c111551a8eaf5d3f0e0c22ee6e6836a626e46402a6ae3cbc0035L262-R274> * rather than delegating to the inner `<B as Body>::is_end_stream()` method, report the end-of-stream being reached by inspecting whether or not the inner response state has been taken. this is the state that directly indicates whether or not the `ResponseBody<B>` middleware is finished. X-ref: #3504 X-ref: #3086 X-ref: linkerd/linkerd2#8733 Signed-off-by: katelyn martin <[email protected]>
cratelyn
added a commit
that referenced
this pull request
Apr 9, 2025
* chore(app/outbound): `linkerd-mock-http-body` test dependency this adds a development dependency, so we can use this mock body type in the outbound proxy's unit tests. Signed-off-by: katelyn martin <[email protected]> * chore(app/outbound): additional http route metrics tests Signed-off-by: katelyn martin <[email protected]> * chore(app/outbound): additional grpc route metrics tests Signed-off-by: katelyn martin <[email protected]> * fix(http/prom): record bodies when eos reached this commit fixes a bug discovered by @alpeb, which was introduced in proxy v2.288.0. > The associated metric is `outbound_http_route_request_statuses_total`: > > ``` > $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10 > ``` > > The problem was introduced in `edge-25.3.4`, with the proxy `v2.288.0`. > Before that the metrics looked like: > > ``` > $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96 > outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96 > ``` > > So the difference is the non-empty value for `error=UNKNOWN` even > when `https_status` is 2xx, which `linkerd viz stat-outbound` > interprets as failed requests. in #3086 we introduced a suite of route- and backend-level metrics. that subsystem contains a body middleware that will report itself as having reached the end-of-stream by delegating directly down to its inner body's `is_end_stream()` hint. this is roughly correct, but is slightly distinct from the actual invariant: a `linkerd_http_prom::record_response::ResponseBody<B>` must call its `end_stream` helper to classify the outcome and increment the corresponding time series in the `outbound_http_route_request_statuses_total` metric family. in #3504 we upgraded our hyper dependency. while doing so, we neglected to include a call to `end_stream` if a data frame is yielded and the inner body reports itself as having reached the end-of-stream. this meant that instrumented bodies would be polled until the end is reached, but were being dropped before a `None` was encountered. this commit fixes this issue in two ways, to be defensive: * invoke `end_stream()` if a non-trailers frame is yielded, and the inner body now reports itself as having ended. this restores the behavior in place prior to #3504. see the relevant component of that diff, here: <https://github.com/linkerd/linkerd2-proxy/pull/3504/files#diff-45d0bc344f76c111551a8eaf5d3f0e0c22ee6e6836a626e46402a6ae3cbc0035L262-R274> * rather than delegating to the inner `<B as Body>::is_end_stream()` method, report the end-of-stream being reached by inspecting whether or not the inner response state has been taken. this is the state that directly indicates whether or not the `ResponseBody<B>` middleware is finished. X-ref: #3504 X-ref: #3086 X-ref: linkerd/linkerd2#8733 Signed-off-by: katelyn martin <[email protected]> --------- Signed-off-by: katelyn martin <[email protected]>
cratelyn
added a commit
to linkerd/website
that referenced
this pull request
Nov 14, 2025
the documentation of our proxy metrics has not kept pace with all of the exciting telemetry that has been introduced since #1599 documented the state of our authorization policy metrics. this commit reworks the documentation to exhaustively document the families of metrics exported by the proxy. this commit does not introduce mention of the _inbound_ metrics that have been added, but does rename this section to "_Endpoint Metrics_" in order to be compatible with the future addition of `inbound_http_route_*`, `inbound_http_route_backend_*`, `inbound_grpc_route*`, and `inbound_grpc_route_backend_*` metrics. see: * linkerd/linkerd2-proxy#2377 * linkerd/linkerd2-proxy#2380 * linkerd/linkerd2-proxy#3086 * linkerd/linkerd2-proxy#3308 * linkerd/linkerd2-proxy#3334 Signed-off-by: katelyn martin <[email protected]>
kflynn
pushed a commit
to linkerd/website
that referenced
this pull request
Nov 20, 2025
* feat(proxy-metrics): document outbound policy routing metrics the documentation of our proxy metrics has not kept pace with all of the exciting telemetry that has been introduced since #1599 documented the state of our authorization policy metrics. this commit reworks the documentation to exhaustively document the families of metrics exported by the proxy. this commit does not introduce mention of the _inbound_ metrics that have been added, but does rename this section to "_Endpoint Metrics_" in order to be compatible with the future addition of `inbound_http_route_*`, `inbound_http_route_backend_*`, `inbound_grpc_route*`, and `inbound_grpc_route_backend_*` metrics. see: * linkerd/linkerd2-proxy#2377 * linkerd/linkerd2-proxy#2380 * linkerd/linkerd2-proxy#3086 * linkerd/linkerd2-proxy#3308 * linkerd/linkerd2-proxy#3334 Signed-off-by: katelyn martin <[email protected]> * chore(markdownlint): allow duplicate "labels" headers Signed-off-by: katelyn martin <[email protected]> --------- Signed-off-by: katelyn martin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics.
The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the effective success rate of each route.
A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing:
The route_backend_response_statuses counters measure the responses from individual backends. This reflects the actual success rate of each route as served by the backend services.
A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched):
Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.