Skip to content

[docs/component-stability.md] Add criteria for graduating between stability levels #11864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 14, 2025

Conversation

mx-psi
Copy link
Member

@mx-psi mx-psi commented Dec 12, 2024

Description

Code ownership and maintenance of components continues to be an issue, with varying levels of support across contrib. As we approach 1.0 and the ability to mark components as stable, we want to make sure that components that we deem as 'stable' have a healthy community around them. We have three datapoints that we can leverage here: how many codeowners a component has, how diverse these are in terms of employers and how actively the codeowners have been responding to issues/PRs in the recent past.

We need criteria that

  1. Are reasonable predictors of the component health over the short/medium term
  2. Are not too onerous on the code owners

Some notes:

  1. Some beta components do not meet the criteria listed on the PR. This will be the case even after the transition for some components. This PR makes no claim as to what should happen to these components stability (so, de facto, they will stay as is).
  2. The OTLP receiver and exporters do not meet this criteria today because they don't have listed code owners. We can solve this either by carving out an exception or by listing code owners.
  3. We need automation and templates to enforce this.

Link to tracking issue

Fixes #11850

@mx-psi mx-psi added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Dec 12, 2024
Copy link

codecov bot commented Dec 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.44%. Comparing base (3068122) to head (a95cdcb).
Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #11864   +/-   ##
=======================================
  Coverage   91.44%   91.44%           
=======================================
  Files         487      487           
  Lines       26848    26848           
=======================================
  Hits        24551    24551           
  Misses       1814     1814           
  Partials      483      483           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

## Beta to stable

To graduate any signal from beta to stable on a component:
1. The component MUST have at least three active code owners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a commitment from codeowners that there is a SLA for first response on bug issues.
The commitment should be measured in days.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises some questions for me including:

  • What happens when people go on vacation/have a kid/[insert activity here that leads to a prolonged period of absence]?
  • What happens if people don't follow this SLA? Typically an SLA means that you pay if you don't meet a certain standard, how do you "pay" here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also raises the question of how we measure that. Do we have any automation in place for it?

Copy link
Member

@julianocosta89 julianocosta89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't make it too hard to have community components.
I think vendor components and widely used components will not have any issue to follow the guidelines.

When we think about components that add value to the overall project, but may not be interesting/priority to vendors, they may struggle to get folks involved on it, and that would disqualify them from being moved to stable.

I do understand that we need to provide a way to ensure maintainability of stable components, but maybe we could draw something over the ideas of:

  • Being active
  • Replying to issues related to its components timely
  • Fixing reported bug timely
  • ...

If we have a component that is not vendor related, but maintained by 2 folks from a single employer, that wouldn't allow them to move on.

Also, let's imagine the following scenario:

  • 3 folks are codeowners of a component, 2 from one company and another one from another company.
  • They graduate to stable.
  • A couple of months later the person that was from the other company moves to the same company of the other 2. Would the component be demoted?

I don't think it should be, if they are active and responsive in issues related to that component.
I know it is a corner case, and may never happen, but it still can.

My main point here, is actually that we shouldn't make it too hard to have community components.
I know a couple of companies that develop internal components to solve their customers' issues. It would be awesome to have a couple of those contributed back to upstream, and let the community grow together.

@mx-psi
Copy link
Member Author

mx-psi commented Dec 17, 2024

@julianocosta89

When we think about components that add value to the overall project, but may not be interesting/priority to vendors, they may struggle to get folks involved on it, and that would disqualify them from being moved to stable.

There is a trade-off between having more components and having fewer components that are more actively maintained. We need to be mindful of where we draw the line, but my feeling is that right now we have too many components that are not well maintained.

I do understand that we need to provide a way to ensure maintainability of stable components, but maybe we could draw something over the ideas of:

  • Being active
  • Replying to issues related to its components timely
  • Fixing reported bug timely
  • ...

There's two questions to consider here:

  1. When do we make a decision to move a component to stable?
  2. When do we make a decision to move a component to unmaintained?

On this PR I am focusing on (1). For doing (1), we need to focus on things that we can measure/check at the time of marking as stable. Some of the things you mention are, I feel like, important criteria for deciding if a component should be moved to unmaintained, but not to move to stable.

If we have a component that is not vendor related, but maintained by 2 folks from a single employer, that wouldn't allow them to move on.

The point with the 'vendor diversity' is that I think it is a good predictor of component quality (more than one vendor means more focus on a wide number of use cases) and maintainability (we don't depend on a single company). Maybe it would help to do an analysis of existing components to see how difficult this is to achieve?

Also, let's imagine the following scenario:

  • 3 folks are codeowners of a component, 2 from one company and another one from another company.
  • They graduate to stable.
  • A couple of months later the person that was from the other company moves to the same company of the other 2. Would the component be demoted?

This PR makes no claims about when a component should be 'demoted'. Currently the only way to be demoted is to be moved to unmaintained, we have some rules about when that can happen. I personally don't think we should move from stable->beta, I think that would be confusing for end users.

The way I see this it is a bit like the CNCF project status: there is no moving from graduated to incubating, only from graduated to deprecated.

@mx-psi
Copy link
Member Author

mx-psi commented Dec 17, 2024

I split off part of this PR in #11937. PTAL at that one first

@julianocosta89
Copy link
Member

There is a trade-off between having more components and having fewer components that are more actively maintained. We need to be mindful of where we draw the line, but my feeling is that right now we have too many components that are not well maintained.

Agree.

For doing (1), we need to focus on things that we can measure/check at the time of marking as stable. Some of the things you mention are, I feel like, important criteria for deciding if a component should be moved to unmaintained, but not to move to stable.

Makes sense.

Maybe it would help to do an analysis of existing components to see how difficult this is to achieve?

Let's take the Connector first, from all components none of them have 3 ACTIVE codeowners. sumconnector is the only component with 3 codeowners, but the 3 of them do not seem much active

  • Connectors:
    • countconnector: 2 codeowners from different vendors
    • exceptionsconnector: 1 codeowner
    • failoverconnector: 2 codeowners (from different vendors?)
    • otlpjsonconnector: 2 codeowners from different vendors
    • roundrobinconnector: 1 codeowner
    • routingconnector: 2 codeowners from different vendors
    • servicegraphconnector: 2 codeowners from different vendors
    • signaltometricsconnector: 2 codeowners (from different vendors?)
    • sumconnector: 3 codeowners (from different vendors?)

On the other hand, I think this "rule" would bring more awareness to the components, and maybe it would also bring more codeowners to "important" components.

Still not sure about the the amount of codeowners.
Most of the connectors (and had a quick look at processors) have just 1 or 2.
I agree 1 is not enough, but wouldn't 2 (from different companies) be enough?

@julianocosta89
Copy link
Member

Still not sure about the the amount of codeowners. Most of the connectors (and had a quick look at processors) have just 1 or 2. I agree 1 is not enough, but wouldn't 2 (from different companies) be enough?

After thinking further and discussing with @mx-psi, I believe this criteria is going to be the foundation for users to further engage and assume codeowner's responsibility in the components they would like to move to stable.
I'd say that at least 3 codeowners is a good number too keep the collector components maintainable are not just pilling up responsibilities to Collector maintainers.

github-merge-queue bot pushed a commit that referenced this pull request Dec 19, 2024
…levels' section (#11937)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

Split off from #11864, describes how the graduation would work without
any additional criteria.

Rendered diagram:


```mermaid
stateDiagram-v2
    state Maintained {
    InDevelopment --> Alpha
    Alpha --> Beta
    Beta --> Stable
    }
    InDevelopment: In Development
    Maintained --> Unmaintained
    Unmaintained --> Maintained
    Maintained --> Deprecated
    Deprecated --> Maintained: (should be rare)
```

---------

Co-authored-by: Christos Markou <[email protected]>
@mx-psi mx-psi marked this pull request as ready for review December 19, 2024 12:14
@mx-psi mx-psi requested a review from a team as a code owner December 19, 2024 12:14
@mx-psi mx-psi removed the Stale label Jan 23, 2025
Copy link
Contributor

github-actions bot commented Feb 7, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 7, 2025
@mx-psi mx-psi removed the Stale label Feb 7, 2025
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 11, 2025
atoulme pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Mar 11, 2025
#### Description

I would like to become a codeowner of the awsfirehosereceiver. Elastic
(where I work) intends to use this receiver heavily, and I would like
to:
 - help spread out the maintenance load
- ensure we have a path to progressing beyond the current Alpha
stability

Assuming
open-telemetry/opentelemetry-collector#11864
goes through, it will be necessary to have multiple codeowners to
progress to Beta and preferable to have codeowners from multiple
employers, which is the case for @Aneurysm9 (AWS) and me (Elastic).

Relevant changes:
-
#37111
(superseded by #37361)
-
#37262
-
#37361
-
#38388
(will extract from awsfirehosereceiver)
-
#38445

#### Link to tracking issue

N/A

#### Testing

N/A

#### Documentation

N/A
@mx-psi mx-psi reopened this Mar 24, 2025
Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the addition of those criteria.

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The criteria look reasonable; my only concern is whether our current automation can support approvers when reviewing Graduation requests.

For the most used components, 60 days is enough time for dozens of PRs/Issues to be opened, and code owners may have changed during this period. Asking approvers to look at all of them and consider code owner changes might be too difficult if done manually. If a process is too difficult to follow, people tend to bypass it 😅

## Beta to stable

To graduate any signal from beta to stable on a component:
1. The component MUST have at least three active code owners.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also raises the question of how we measure that. Do we have any automation in place for it?

@github-actions github-actions bot removed the Stale label Mar 27, 2025
@mx-psi
Copy link
Member Author

mx-psi commented Mar 27, 2025

@ArthurSens I think those concerns are reasonable, we definitely need more automation to be able to do this at scale and over time. I think the focus at this stage should be whether these requirements are automatable, we will automate things once we have the rules defined

@mx-psi
Copy link
Member Author

mx-psi commented Apr 8, 2025

I intend to merge this in a couple of days if there are no further comments

@douglascamata
Copy link
Member

Are there any plans regarding "technical stability/maturity"? For instance, a threshold of test coverage and/or linting issues.

I'm thinking here from a "user" point of view. What does stability mean to users? Number and activity of maintainers is a point, but I also believe there are some technical aspects to it that could be considered (or not).

@mx-psi
Copy link
Member Author

mx-psi commented Apr 8, 2025

Are there any plans regarding "technical stability/maturity"? For instance, a threshold of test coverage and/or linting issues.

@douglascamata Yes! see #11553 for the full list of things I want to cover in this document. If you think the list is incomplete or you would like to help me put this into wording feel free to comment on the issues/file a PR :)

@mx-psi mx-psi requested a review from djaglowski April 9, 2025 09:31
@mx-psi
Copy link
Member Author

mx-psi commented Apr 11, 2025

@djaglowski I will wait a bit more in case you have further comments and merge this on Monday morning

@mx-psi mx-psi added this pull request to the merge queue Apr 14, 2025
Merged via the queue into open-telemetry:main with commit 28ca163 Apr 14, 2025
56 checks passed
@mx-psi mx-psi deleted the mx-psi/codeowner-requirements branch April 14, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Establish codeowners minimum criteria for moving up through the stability ladder
8 participants