Skip to content

CI/CD conventions for metrics #1111

@christophe-kamphaus-jemmic

Description

Area(s)

area:cicd

Is your change request related to a problem? Please describe.

This issue is to discuss attributes specific to metrics and as part of the CI/CD Working Group and Semantic Conventions WG.
Also a challenge specific to metrics can the time series cardinality when CICD observes metrics for individual builds.

Describe the solution you'd like

Following #1075 (by adjusting the vocabulary here below to align with #1075) we should define metric attributes for

  • duration of pipelineRuns (by status, pipeline)
  • count of pipelineRuns (by status, pipeline)
  • count of agents
  • queue length of pending pipelineRuns
  • duration for how long a pipelineRun is in the queue before starting execution

Additionally it should be possible to opt-in to metrics specific to a particular pipelineRun.
These could be metrics about the agent which executes a pipelineRun, the OS, network, jvm, the number of failed/total tests …
We need to specify the attribute which should link these metrics to the pipelineRun, eg. pipeline.run.id

Metrics specific to a pipelineRun are of high cardinality. We should document this as a warning and give guidance how these metrics can be efficiently encoded in the OTel protocol, ie by using resource attributes instead of metric attributes wherever possible.

Describe alternatives you've considered

Span metrics could be used for duration and count of pipelineRuns, however this relies on the pipelineRuns having completed.
This is due to limitations inherent in using traces to represent pipelineRuns, a span can only be sent when complete.
Due to this limitation it could be preferable for the CICD system to expose metrics directly about the duration, count and status of pipelineRuns. These pipelineRuns could account also for in progress builds.

Additional context

CICD metrics were discussed at KubeCon March 2024 SemConv users meeting.
High cardinality was highlighted as an issue for per build metrics.
Notes on how to deal with cardinality were:

  • Could we use Exemplars? We could link to the build trace from some metrics.
    This added information might make it easier to identify pipelineRuns that need investigation.
  • Using the resource attribute for the build ID is fine for the OTel protocol,
    but backends (eg. Prometheus) would still have the cardinality issue when storing the time series
    (metric / resource attributes would be flattened into time series).

Metadata

Metadata

Type

No type

Projects

Status

No status

Status

Need triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions