RFC: Health Report API

As: a person responsible for ensuring Logstash pipelines are running without issue
I want: a health report similar to the one provided by Elasticsearch
So that: I can easily identify issues with the health of a Logstash process and its pipelines

## Phase 1: API & Initial Indicators

This is a plan to deliver a `GET /_health_report` endpoint that is _heavily_ inspired by the shape of Elasticsearch's endpoint of the same name, adhering to its prior art wherever possible.

In Elasticsearch, the health report endpoint presents a flat collection of `indicators` and a `status` that reflects the status of the least-healthy indicator in the collection. Each indicator has its own `status` and optional `symptom`, and may optionally provide information about the `impacts` of a less-than-healthy status, one or more `diagnosis` with information about the cause, actions-to-remediate, and links to help, and/or details relevant to that particular indicator.

Because many aspects of Logstash processing and the actionable insights required by operational administrators are _pipeline_-centric, we will introduce a top-level indicator called `pipelines` that will have a mapping of sub-`indicators`, one for each pipeline. Like the top-level `#/status`, the value of `#/indicators/pipelines/status` will bubble up the least-healthy status of its component indicators.

The Logstash agent will maintain a pipeline-indicator for each pipeline that the agent knows about, and each pipeline-indicator will have one or more probes that are capable of marking the indicator as unhealthy with a `diagnosis` and optional `impacts`.

### Proposed Schema:

<details>
<summary>
Click to expand Schema
</summary>

> ~~~ yaml
> ---
> $schema: https://json-schema.org/draft/2020-12/schema
> 
> $id: health_report
> type: object
> title: Logstash Health Report
> description: >-
>   Information about Logstash, its health status, and the indicators that
>   were used to determine the current status.
> properties:
>   host:
>     type: string
>   version:
>     type: string
>   snapshot:
>     type: boolean
>   name:
>     type: string
>   ephemeral_id:
>     type: string
> allOf:
>   - $ref: 'schema:has_status'
>     required: ['status']
>   - $ref: 'schema:has_indicators'
>     required: ['indicators']
> unevaluatedProperties: false
> 
> $defs:
>   indicator:
>     $id: schema:indicator
>     description: >-
>       An indicator has a `status` and a `symptom`, and may provide `details`,
>       `impacts`, and/or `diagnosis`, or may be composed of internal `indicators`
>     allOf:
>       - $ref: 'schema:has_status'
>         required: ['status']
>       - $ref: 'schema:has_symptom'
>       - $ref: 'schema:has_details'
>       - $ref: 'schema:has_impacts'
>       - $ref: 'schema:has_diagnosis'
>       - $ref: 'schema:has_indicators'
>         # NOTE: indicator/has_indicators is the only addition to the
> 
>     unevaluatedProperties: false
>   has_status:
>     $id: schema:has_status
>     type: object
>     properties:
>       status:
>         type: string
>         description: >-
>           Health status
>         oneOf:
>           - const: green
>             description: everything is okay
>           - const: unknown
>             description: status could not be determined
>           - const: yellow
>             description: functionality is in a degraded state and may need remediation
>           - const: red
>             description: functionality is in a critical state and remediation is urgent
>   has_symptom:
>     $id: schema:has_symptom
>     type: object
>     properties:
>       symptom:
>         type: string
>         description: >-
>           A brief message providing information about the current health status
>         maxLength: 1024
>   has_indicators:
>     $id: schema:has_indicators
>     type: object
>     properties:
>       indicators:
>         type: object
>         description: >-
>           A key/value map of one or more named component indicators that were used
>           to determine this indicator's health status
>         minProperties: 1
>         additionalProperties:
>           $ref: 'schema:indicator'
>   has_diagnosis:
>     $id: schema:has_diagnosis
>     type: object
>     properties:
>       diagnosis:
>         type: array
>         items:
>           $ref: 'schema:diagnosis'
>   has_impacts:
>     $id: 'schema:has_impacts'
>     type: object
>     properties:
>       impacts:
>         type: array
>         items:
>           $ref: 'schema:impact'
>   has_details:
>     $id: 'schema:has_details'
>     type: object
>     properties:
>       details:
>         type: object
>   diagnosis:
>     $id: 'schema:diagnosis'
>     type: object
>     properties:
>       cause:
>         type: string
>         description: >-
>           A brief description of a root cause of this health problem
>       action:
>         type: string
>         description: >-
>           A brief description the steps that should be taken to remediate the
>           problem. A more detailed step-by-step guide to remediate the problem
>           is provided by the `help_url` field.
>       affected_resources:
>         # Supported shape and values TBD; this matches the coresponding entry
>         # in Elasticsearch, but its shape is ambiguous as docs claim it to be
>         # a list of strings while it has been observed to be a map of resource
>         # type to list of strings.
>         # see: https://github.com/elastic/elasticsearch/issues/106925
>         description: >-
>           If the root cause pertains to multiple resources in Logstash, this
>           will hold all resources that this diagnosis is applicable for.
>       help_url:
>         type: string
>         description: >-
>           A link to the troubleshooting guide that’ll fix the health problem.
>     required:
>       - cause
>       - action
>       - help_url
>     unevaluatedProperties: false
>   impact:
>     $id: 'schema:impact'
>     type: object
>     properties:
>       severity:
>         type: integer
>         description: >-
>           How important this impact is to functionality. A value of 1
>           is the highest severity, with larger values indicating lower severity.
>       description:
>         type: string
>         description: >-
>           A description of the impact on the subject of the indicator.
>       impact_areas:
>         type: array
>         description: >-
>           The areas of functionality that this impact affects.
>         items:
>           # Supported enum values TBD; this matches the corresponding shape in
>           # Elasticsearch, but we have yet to determine a semantic match.
>           type: string
>           oneOf:
>             - const: unknown
>               description: the area of impact is unknown.
>     required:
>       - severity
>       - description
>     unevaluatedProperties: false
> ~~~

</details>

<details>
<summary>
Click to expand Example
</summary>

> ~~~ yaml
> {
>   "status": "yellow",
>   "host": "logstash-742.internal",
>   "version": "8.14.1",
>   "snapshot": false,
>   "ephemeral_id": "0f4ac35f-5d5a-4067-9533-7893197cf5f9",
>   "indicators": {
>     "resources": {
>       "status": "yellow",
>       "diagnosis": [
>         {
>           "cause": "JVM garbage collection is spending significant time",
>           "action": "Tune memory to reflect your workload",
>           "help_url": "https://ela.st/logstash-memory-pressure"
>         }
>       ]
>     },
>     "pipelines": {
>       "status": "yellow",
>       "indicators": {
>         "pipeline-one": {
>           "status": "green",
>           "symptom": "everything is okay"
>         },
>         "pipeline-two": {
>           "status": "yellow",
>           "symptom": "multiple probes report degraded processing",
>           "diagnosis": [
>             {
>               "cause": "persistent queue shows net growth over the last 5 minutes",
>               "action": "look downstream for backpressure",
>               "help_url": "https://ela.st/logstash-pq-growth"
>             },{
>               "cause": "workers fully utilized",
>               "action": "increase worker capacity",
>               "help_url": "https://ela.st/logstash-worker-allocation"
>             }
>           ],
>           "impacts": [
>              {
>                "severity": 10,
>                "description": "Growth of the Persisted Queue means increased lag"
>              },{
>                "severity": 10,
>                "description": "When workers are fully utilized their throughput is limited"
>              }
>           ]
>         }
>       }
>     }
>   }
> }
> ~~~

</details>

Internally:
 - an indicator either _has at least one_ probe XOR _has at least one inner indicator_, and is only as healthy as its least-healthy component.
 - probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
 - the top-level `status` of _any_ api response that includes it reflects the same value as one running the `GET /_health_report`, including `GET /_node` and `GET /_node_stats`.
 - the health report itself will initially be on-demand when the API request is made, but may later be made to run on a schedule or cached.

***

In the first stage we will introduce the `GET /_health_report` endpoint itself with the following indicators and probes:

 - `#/indicators/resources`
   - probe: `resources:memory_pressure`:
     - degraded when GC pause > 4% of wall-clock
     - critical when GC pause > 8% of wall-clock
     - first-pass: lifetime cumuilative metric
     - stretch goal: flow metric using `last_1_minute` window
 - `#/indicators/pipelines/indicators/<pipeline-id>`:
   - probe: `pipelines:up`:
     - critical when stopped
     - degraded when starting or stopping
     - okay when running
     - stretch goal: track transition states, including restarts, so that we can keep green through a quick restart.

## Phase 2: Additional pipeline probes

In subsequent stages we will introduce additional probes to the pipeline indicators to allow them to diagnose the pipeline's behavior from its _flow_ state. Each probe will feed off of the flow metrics for the pipeline, and will present pipeline-specific settings in the `pipeline.health.probe.<probe-name>` namespace for configuration.

For example, a probe `queue_persisted_growth_events` that inspects the `queue_persisted_growth_events` flow metric would have default settings like:

~~~ yaml
pipeline.health.probe.queue_persisted_growth_events:
  enabled: auto # true if queue.type == persisted
  degraded: last_5_minutes > 0
  critical: last_15_minutes > 0
~~~

Or a `worker_utilization` probe that inspects the `worker_utilization` flow metric to report issues if the workers are fully-utilized:

~~~ yaml
pipeline.health.probe.worker_utilization:
  enabled: true
  degraded: last_1_minute > 99
  critical: last_15_minutes > 99
~~~

> Implementation note: the `dentaku` calculator library for ruby would allow us to parse a math expression with variables into an AST, to query that AST for the named variables it needs, and to efficiently evaluate that AST with a set of provided variables. This allows us to (1) support an explicit subset of possibe AST's (notably `Arithmetic`, `Comparator`, `Grouping`, `Numeric`, and `Identifier`, possibly `Function(type=logical)` ), (2) reject configurations that reference variables we know will never exist, and (3) avoid evaluating a `trigger` or `recover` expression when the flow metric window it needs is not _yet_ available. It is a prime candidate for accelerating development, but care should be taken to _avoid_ using its auto-ast-caching (which has no cache invalidation), and to limit expressions to the minimum-necessary allowlist of node types to ensure portability (likely with a validation visitor).

### Split options:

If making these probes configurable adds substantial delay, then we can ship them hard-coded with only the `enabled` option, and split the configurability off into a separate effort.

## Phase 3: Observing recovery in critical probes

With flow metrics, it is possible to differentiate active-critical situations from ones in active recovery. For example, a PQ having net-growth over the last 15 minutes may be a critical situation, but if we obseerve that we also have net-shrink over the last 5 minutes the situation isn't as dire, so it (a) shouldn't push the indicator into the red and (b) is capable of producing different diagnostic output.

At a future point we can add the concept of `recovery` to the flow-based probe prototype. When a probe tests positive for `critical`, we could also test its `recovery` to present an appropriate result.

~~~ yaml
pipeline.health.probe.queue_persisted_growth_events:
  enabled: auto # true if queue.type == persisted
  degraded: last_5_minutes > 0
  critical: last_15_minutes > 0
  recovery: last_1_minute <= 0
~~~


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Health Report API #16056

Phase 1: API & Initial Indicators

Proposed Schema:

Phase 2: Additional pipeline probes

Split options:

Phase 3: Observing recovery in critical probes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Health Report API #16056

Description

Phase 1: API & Initial Indicators

Proposed Schema:

Phase 2: Additional pipeline probes

Split options:

Phase 3: Observing recovery in critical probes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions