Skip to content

Conversation

@olix0r
Copy link
Member

@olix0r olix0r commented Jul 23, 2024

The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics.

The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the effective success rate of each route.

outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0

A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing:

outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
outbound_http_route_request_duration_seconds_count{parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0

The route_backend_response_statuses counters measure the responses from individual backends. This reflects the actual success rate of each route as served by the backend services.

outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0

A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched):

outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0

Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.

The outbound policy router includes a requests counter that measures the number
of requests dispatched to each route-backend; but this does not provide
visibility into success rate or response time. Before introducing timeouts and
retires on outbound routes, this change introduces visibility into per-route
response metrics.

The route_request_statuses counters measure responses from the application's
point of view. Once retries are introduced, this will provide visibility into
the _effective_ success rate of each route.

    outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
    outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0

A coarse histogram is introduced at this scope to track the total duration of
requests dispatched to each route, covering all retries and all response stream
processing:

    outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
    outbound_http_route_request_duration_seconds_count{parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0

The route_backend_response_statuses counters measure the responses from
individual backends. This reflects the _actual_ success rate of each route as
served by the backend services.

    outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
    outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0

A slightly more detailed histogram is introduced at this scope to track the time
spend processing responses from each backend (i.e. after the request has been
fully dispatched):

    outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0

Note that duration histograms omit status code labels, as they needlessly
inflate metrics cardinality. The histograms that we have introduced here are
generally much more constrained, as we much choose broadly applicable buckets
and want to avoid cardinality explosion when many routes are used.
@olix0r olix0r requested a review from a team as a code owner July 23, 2024 17:49
@olix0r olix0r merged commit 7c99d15 into main Jul 23, 2024
@olix0r olix0r deleted the ver/http-prom branch July 23, 2024 18:16
cratelyn added a commit that referenced this pull request Apr 9, 2025
this commit fixes a bug discovered by @alpeb, which was introduced in
proxy v2.288.0.

> The associated metric is `outbound_http_route_request_statuses_total`:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10
> ```
>
> The problem was introduced in `edge-25.3.4`, with the proxy `v2.288.0`.
> Before that the metrics looked like:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96
> ```
>
> So the difference is the non-empty value for `error=UNKNOWN` even
> when `https_status` is 2xx, which `linkerd viz stat-outbound`
> interprets as failed requests.

in #3086 we introduced a suite of route- and backend-level metrics. that
subsystem contains a body middleware that will report itself as having
reached the end-of-stream by delegating directly down to its inner
body's `is_end_stream()` hint.

this is roughly correct, but is slightly distinct from the actual
invariant: a `linkerd_http_prom::record_response::ResponseBody<B>` must
call its `end_stream` helper to classify the outcome and increment the
corresponding time series in the
`outbound_http_route_request_statuses_total` metric family.

in #3504 we upgraded our hyper dependency. while doing so, we neglected
to include a call to `end_stream` if a data frame is yielded and the
inner body reports itself as having reached the end-of-stream.

this meant that instrumented bodies would be polled until the end is
reached, but were being dropped before a `None` was encountered.

this commit fixes this issue in two ways, to be defensive:

* invoke `end_stream()` if a non-trailers frame is yielded, and the
  inner body now reports itself as having ended. this restores the
  behavior in place prior to #3504. see the relevant component of that
  diff, here:
  <https://github.com/linkerd/linkerd2-proxy/pull/3504/files#diff-45d0bc344f76c111551a8eaf5d3f0e0c22ee6e6836a626e46402a6ae3cbc0035L262-R274>

* rather than delegating to the inner `<B as Body>::is_end_stream()`
  method, report the end-of-stream being reached by inspecting whether
  or not the inner response state has been taken. this is the state that
  directly indicates whether or not the `ResponseBody<B>` middleware is
  finished.

X-ref: #3504
X-ref: #3086
X-ref: linkerd/linkerd2#8733
Signed-off-by: katelyn martin <[email protected]>
cratelyn added a commit that referenced this pull request Apr 9, 2025
* chore(app/outbound): `linkerd-mock-http-body` test dependency

this adds a development dependency, so we can use this mock body type in
the outbound proxy's unit tests.

Signed-off-by: katelyn martin <[email protected]>

* chore(app/outbound): additional http route metrics tests

Signed-off-by: katelyn martin <[email protected]>

* chore(app/outbound): additional grpc route metrics tests

Signed-off-by: katelyn martin <[email protected]>

* fix(http/prom): record bodies when eos reached

this commit fixes a bug discovered by @alpeb, which was introduced in
proxy v2.288.0.

> The associated metric is `outbound_http_route_request_statuses_total`:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10
> ```
>
> The problem was introduced in `edge-25.3.4`, with the proxy `v2.288.0`.
> Before that the metrics looked like:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96
> ```
>
> So the difference is the non-empty value for `error=UNKNOWN` even
> when `https_status` is 2xx, which `linkerd viz stat-outbound`
> interprets as failed requests.

in #3086 we introduced a suite of route- and backend-level metrics. that
subsystem contains a body middleware that will report itself as having
reached the end-of-stream by delegating directly down to its inner
body's `is_end_stream()` hint.

this is roughly correct, but is slightly distinct from the actual
invariant: a `linkerd_http_prom::record_response::ResponseBody<B>` must
call its `end_stream` helper to classify the outcome and increment the
corresponding time series in the
`outbound_http_route_request_statuses_total` metric family.

in #3504 we upgraded our hyper dependency. while doing so, we neglected
to include a call to `end_stream` if a data frame is yielded and the
inner body reports itself as having reached the end-of-stream.

this meant that instrumented bodies would be polled until the end is
reached, but were being dropped before a `None` was encountered.

this commit fixes this issue in two ways, to be defensive:

* invoke `end_stream()` if a non-trailers frame is yielded, and the
  inner body now reports itself as having ended. this restores the
  behavior in place prior to #3504. see the relevant component of that
  diff, here:
  <https://github.com/linkerd/linkerd2-proxy/pull/3504/files#diff-45d0bc344f76c111551a8eaf5d3f0e0c22ee6e6836a626e46402a6ae3cbc0035L262-R274>

* rather than delegating to the inner `<B as Body>::is_end_stream()`
  method, report the end-of-stream being reached by inspecting whether
  or not the inner response state has been taken. this is the state that
  directly indicates whether or not the `ResponseBody<B>` middleware is
  finished.

X-ref: #3504
X-ref: #3086
X-ref: linkerd/linkerd2#8733
Signed-off-by: katelyn martin <[email protected]>

---------

Signed-off-by: katelyn martin <[email protected]>
cratelyn added a commit to linkerd/website that referenced this pull request Nov 14, 2025
the documentation of our proxy metrics has not kept pace with all of the
exciting telemetry that has been introduced since
#1599 documented the state of our
authorization policy metrics.

this commit reworks the documentation to exhaustively document the
families of metrics exported by the proxy. this commit does not
introduce mention of the _inbound_ metrics that have been added, but
does rename this section to "_Endpoint Metrics_" in order to be
compatible with the future addition of `inbound_http_route_*`,
`inbound_http_route_backend_*`, `inbound_grpc_route*`, and
`inbound_grpc_route_backend_*` metrics.

see:
* linkerd/linkerd2-proxy#2377
* linkerd/linkerd2-proxy#2380
* linkerd/linkerd2-proxy#3086
* linkerd/linkerd2-proxy#3308
* linkerd/linkerd2-proxy#3334

Signed-off-by: katelyn martin <[email protected]>
kflynn pushed a commit to linkerd/website that referenced this pull request Nov 20, 2025
* feat(proxy-metrics): document outbound policy routing metrics

the documentation of our proxy metrics has not kept pace with all of the
exciting telemetry that has been introduced since
#1599 documented the state of our
authorization policy metrics.

this commit reworks the documentation to exhaustively document the
families of metrics exported by the proxy. this commit does not
introduce mention of the _inbound_ metrics that have been added, but
does rename this section to "_Endpoint Metrics_" in order to be
compatible with the future addition of `inbound_http_route_*`,
`inbound_http_route_backend_*`, `inbound_grpc_route*`, and
`inbound_grpc_route_backend_*` metrics.

see:
* linkerd/linkerd2-proxy#2377
* linkerd/linkerd2-proxy#2380
* linkerd/linkerd2-proxy#3086
* linkerd/linkerd2-proxy#3308
* linkerd/linkerd2-proxy#3334

Signed-off-by: katelyn martin <[email protected]>

* chore(markdownlint): allow duplicate "labels" headers

Signed-off-by: katelyn martin <[email protected]>

---------

Signed-off-by: katelyn martin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants