Skip to content

RFC: Health Report API #16056

@yaauie

Description

@yaauie

As: a person responsible for ensuring Logstash pipelines are running without issue
I want: a health report similar to the one provided by Elasticsearch
So that: I can easily identify issues with the health of a Logstash process and its pipelines

Phase 1: API & Initial Indicators

This is a plan to deliver a GET /_health_report endpoint that is heavily inspired by the shape of Elasticsearch's endpoint of the same name, adhering to its prior art wherever possible.

In Elasticsearch, the health report endpoint presents a flat collection of indicators and a status that reflects the status of the least-healthy indicator in the collection. Each indicator has its own status and optional symptom, and may optionally provide information about the impacts of a less-than-healthy status, one or more diagnosis with information about the cause, actions-to-remediate, and links to help, and/or details relevant to that particular indicator.

Because many aspects of Logstash processing and the actionable insights required by operational administrators are pipeline-centric, we will introduce a top-level indicator called pipelines that will have a mapping of sub-indicators, one for each pipeline. Like the top-level #/status, the value of #/indicators/pipelines/status will bubble up the least-healthy status of its component indicators.

The Logstash agent will maintain a pipeline-indicator for each pipeline that the agent knows about, and each pipeline-indicator will have one or more probes that are capable of marking the indicator as unhealthy with a diagnosis and optional impacts.

Proposed Schema:

Click to expand Schema
---
$schema: https://json-schema.org/draft/2020-12/schema

$id: health_report
type: object
title: Logstash Health Report
description: >-
  Information about Logstash, its health status, and the indicators that
  were used to determine the current status.
properties:
  host:
    type: string
  version:
    type: string
  snapshot:
    type: boolean
  name:
    type: string
  ephemeral_id:
    type: string
allOf:
  - $ref: 'schema:has_status'
    required: ['status']
  - $ref: 'schema:has_indicators'
    required: ['indicators']
unevaluatedProperties: false

$defs:
  indicator:
    $id: schema:indicator
    description: >-
      An indicator has a `status` and a `symptom`, and may provide `details`,
      `impacts`, and/or `diagnosis`, or may be composed of internal `indicators`
    allOf:
      - $ref: 'schema:has_status'
        required: ['status']
      - $ref: 'schema:has_symptom'
      - $ref: 'schema:has_details'
      - $ref: 'schema:has_impacts'
      - $ref: 'schema:has_diagnosis'
      - $ref: 'schema:has_indicators'
        # NOTE: indicator/has_indicators is the only addition to the

    unevaluatedProperties: false
  has_status:
    $id: schema:has_status
    type: object
    properties:
      status:
        type: string
        description: >-
          Health status
        oneOf:
          - const: green
            description: everything is okay
          - const: unknown
            description: status could not be determined
          - const: yellow
            description: functionality is in a degraded state and may need remediation
          - const: red
            description: functionality is in a critical state and remediation is urgent
  has_symptom:
    $id: schema:has_symptom
    type: object
    properties:
      symptom:
        type: string
        description: >-
          A brief message providing information about the current health status
        maxLength: 1024
  has_indicators:
    $id: schema:has_indicators
    type: object
    properties:
      indicators:
        type: object
        description: >-
          A key/value map of one or more named component indicators that were used
          to determine this indicator's health status
        minProperties: 1
        additionalProperties:
          $ref: 'schema:indicator'
  has_diagnosis:
    $id: schema:has_diagnosis
    type: object
    properties:
      diagnosis:
        type: array
        items:
          $ref: 'schema:diagnosis'
  has_impacts:
    $id: 'schema:has_impacts'
    type: object
    properties:
      impacts:
        type: array
        items:
          $ref: 'schema:impact'
  has_details:
    $id: 'schema:has_details'
    type: object
    properties:
      details:
        type: object
  diagnosis:
    $id: 'schema:diagnosis'
    type: object
    properties:
      cause:
        type: string
        description: >-
          A brief description of a root cause of this health problem
      action:
        type: string
        description: >-
          A brief description the steps that should be taken to remediate the
          problem. A more detailed step-by-step guide to remediate the problem
          is provided by the `help_url` field.
      affected_resources:
        # Supported shape and values TBD; this matches the coresponding entry
        # in Elasticsearch, but its shape is ambiguous as docs claim it to be
        # a list of strings while it has been observed to be a map of resource
        # type to list of strings.
        # see: https://github.com/elastic/elasticsearch/issues/106925
        description: >-
          If the root cause pertains to multiple resources in Logstash, this
          will hold all resources that this diagnosis is applicable for.
      help_url:
        type: string
        description: >-
          A link to the troubleshooting guide that’ll fix the health problem.
    required:
      - cause
      - action
      - help_url
    unevaluatedProperties: false
  impact:
    $id: 'schema:impact'
    type: object
    properties:
      severity:
        type: integer
        description: >-
          How important this impact is to functionality. A value of 1
          is the highest severity, with larger values indicating lower severity.
      description:
        type: string
        description: >-
          A description of the impact on the subject of the indicator.
      impact_areas:
        type: array
        description: >-
          The areas of functionality that this impact affects.
        items:
          # Supported enum values TBD; this matches the corresponding shape in
          # Elasticsearch, but we have yet to determine a semantic match.
          type: string
          oneOf:
            - const: unknown
              description: the area of impact is unknown.
    required:
      - severity
      - description
    unevaluatedProperties: false
Click to expand Example
{
  "status": "yellow",
  "host": "logstash-742.internal",
  "version": "8.14.1",
  "snapshot": false,
  "ephemeral_id": "0f4ac35f-5d5a-4067-9533-7893197cf5f9",
  "indicators": {
    "resources": {
      "status": "yellow",
      "diagnosis": [
        {
          "cause": "JVM garbage collection is spending significant time",
          "action": "Tune memory to reflect your workload",
          "help_url": "https://ela.st/logstash-memory-pressure"
        }
      ]
    },
    "pipelines": {
      "status": "yellow",
      "indicators": {
        "pipeline-one": {
          "status": "green",
          "symptom": "everything is okay"
        },
        "pipeline-two": {
          "status": "yellow",
          "symptom": "multiple probes report degraded processing",
          "diagnosis": [
            {
              "cause": "persistent queue shows net growth over the last 5 minutes",
              "action": "look downstream for backpressure",
              "help_url": "https://ela.st/logstash-pq-growth"
            },{
              "cause": "workers fully utilized",
              "action": "increase worker capacity",
              "help_url": "https://ela.st/logstash-worker-allocation"
            }
          ],
          "impacts": [
             {
               "severity": 10,
               "description": "Growth of the Persisted Queue means increased lag"
             },{
               "severity": 10,
               "description": "When workers are fully utilized their throughput is limited"
             }
          ]
        }
      }
    }
  }
}

Internally:

  • an indicator either has at least one probe XOR has at least one inner indicator, and is only as healthy as its least-healthy component.
  • probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
  • the top-level status of any api response that includes it reflects the same value as one running the GET /_health_report, including GET /_node and GET /_node_stats.
  • the health report itself will initially be on-demand when the API request is made, but may later be made to run on a schedule or cached.

In the first stage we will introduce the GET /_health_report endpoint itself with the following indicators and probes:

  • #/indicators/resources
    • probe: resources:memory_pressure:
      • degraded when GC pause > 4% of wall-clock
      • critical when GC pause > 8% of wall-clock
      • first-pass: lifetime cumuilative metric
      • stretch goal: flow metric using last_1_minute window
  • #/indicators/pipelines/indicators/<pipeline-id>:
    • probe: pipelines:up:
      • critical when stopped
      • degraded when starting or stopping
      • okay when running
      • stretch goal: track transition states, including restarts, so that we can keep green through a quick restart.

Phase 2: Additional pipeline probes

In subsequent stages we will introduce additional probes to the pipeline indicators to allow them to diagnose the pipeline's behavior from its flow state. Each probe will feed off of the flow metrics for the pipeline, and will present pipeline-specific settings in the pipeline.health.probe.<probe-name> namespace for configuration.

For example, a probe queue_persisted_growth_events that inspects the queue_persisted_growth_events flow metric would have default settings like:

pipeline.health.probe.queue_persisted_growth_events:
  enabled: auto # true if queue.type == persisted
  degraded: last_5_minutes > 0
  critical: last_15_minutes > 0

Or a worker_utilization probe that inspects the worker_utilization flow metric to report issues if the workers are fully-utilized:

pipeline.health.probe.worker_utilization:
  enabled: true
  degraded: last_1_minute > 99
  critical: last_15_minutes > 99

Implementation note: the dentaku calculator library for ruby would allow us to parse a math expression with variables into an AST, to query that AST for the named variables it needs, and to efficiently evaluate that AST with a set of provided variables. This allows us to (1) support an explicit subset of possibe AST's (notably Arithmetic, Comparator, Grouping, Numeric, and Identifier, possibly Function(type=logical) ), (2) reject configurations that reference variables we know will never exist, and (3) avoid evaluating a trigger or recover expression when the flow metric window it needs is not yet available. It is a prime candidate for accelerating development, but care should be taken to avoid using its auto-ast-caching (which has no cache invalidation), and to limit expressions to the minimum-necessary allowlist of node types to ensure portability (likely with a validation visitor).

Split options:

If making these probes configurable adds substantial delay, then we can ship them hard-coded with only the enabled option, and split the configurability off into a separate effort.

Phase 3: Observing recovery in critical probes

With flow metrics, it is possible to differentiate active-critical situations from ones in active recovery. For example, a PQ having net-growth over the last 15 minutes may be a critical situation, but if we obseerve that we also have net-shrink over the last 5 minutes the situation isn't as dire, so it (a) shouldn't push the indicator into the red and (b) is capable of producing different diagnostic output.

At a future point we can add the concept of recovery to the flow-based probe prototype. When a probe tests positive for critical, we could also test its recovery to present an appropriate result.

pipeline.health.probe.queue_persisted_growth_events:
  enabled: auto # true if queue.type == persisted
  degraded: last_5_minutes > 0
  critical: last_15_minutes > 0
  recovery: last_1_minute <= 0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions