Skip to content

Re-try failed upgrades #3893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 24, 2025

Conversation

andreasgerstmayr
Copy link
Contributor

Description:
Currently, the operator upgrades all OpenTelemetryCollector CRs on startup of the operator. If any upgrade fails (e.g. intermittent errors of the Kubernetes API server), it won't be re-tried until the operator is restarted.

This commit moves the upgrade step to the reconcile loop. The operator reconciles all managed instances on startup, and in case of an error, re-tries the upgrade with exponential backoff.

Additionally, I changed the event type from Error to Warning, because "Error" is not a valid event type.

Link to tracking Issue(s):

Testing:

Manual testing, existing upgrade tests.

Documentation:

Not required

Currently, the operator upgrades all OpenTelemetryCollector CRs on
startup of the operator. If any upgrade fails (e.g. intermittent errors
of the Kubernetes API server), it won't be re-tried until the operator
is restarted.

This commit moves the upgrade step to the reconcile loop. The operator
reconciles all managed instances on startup, and in case of an error,
re-tries the upgrade with exponential backoff.

Additionally, I changed the event type from Error to Warning, because
"Error" is not a valid event type.

Signed-off-by: Andreas Gerstmayr <[email protected]>
@andreasgerstmayr andreasgerstmayr requested a review from a team as a code owner April 10, 2025 11:10
@@ -509,23 +508,9 @@ func main() {
}
}

func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config, v version.Version) error {
func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config) error {
// adds the upgrade mechanism to be executed once the manager is ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line can be removed now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the upgrade process for the Instrumentation CR.
The Instrumentation CR upgrade is a bit different, it doesn't use versions (I can't skip the upgrade because I don't know if it's up-to-date), therefore I left it as-is for now.

Signed-off-by: Andreas Gerstmayr <[email protected]>
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

On a separate note, we definitely need more tests for this change.

@andreasgerstmayr
Copy link
Contributor Author

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

#3515 has more context why it's required in the reconcile loop. tl;dr

  • outdated instances should be upgraded when the management state is switched from unmanaged back to managed
  • re-trying failed upgrades with exponential backoff comes for free when it's in the reconcile loop
  • I'd argue the reconcile should never run before the CR is up-to-date, because the manifest generation code should be able to assume that the CR is up-to-date.

On a separate note, we definitely need more tests for this change.

Ok, I'll add some next week.

Signed-off-by: Andreas Gerstmayr <[email protected]>
@swiatekm
Copy link
Contributor

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

#3515 has more context why it's required in the reconcile loop. tl;dr

* outdated instances should be upgraded when the management state is switched from unmanaged back to managed

* re-trying failed upgrades with exponential backoff comes for free when it's in the reconcile loop

* I'd argue the reconcile should never run before the CR is up-to-date, because the manifest generation code should be able to assume that the CR is up-to-date.

On a separate note, we definitely need more tests for this change.

Ok, I'll add some next week.

Allright, that makes sense. In that case, the next best thing is for a single reconciliation to only do one thing. Either upgrade, or generate and apply new manifests, never both. This will also make testing much easier.

When it comes to testing, we do have some reconciliation tests running against a real API Server via envtest, but they're painful to write due to needing to manually wait for each condition to be reflected in controller-runtime's caching K8s client. So I won't blame you if you only add a chainsaw e2e test.

Signed-off-by: Andreas Gerstmayr <[email protected]>
@andreasgerstmayr
Copy link
Contributor Author

When it comes to testing, we do have some reconciliation tests running against a real API Server via envtest, but they're painful to write due to needing to manually wait for each condition to be reflected in controller-runtime's caching K8s client. So I won't blame you if you only add a chainsaw e2e test.

I gave it a try with envtest, and hit one error: the object has been modified; please apply your changes to the latest version and try again thrown at line

after the first reconciliation.

I think it has to do with caching, as you mentioned. Fetching the object again resolved the (intermittent) issue.

I added three tests, one for upgrade, one already-up-to-date and one with an empty version in the status field.
Let me know if I should add more test cases.

Client: params.Client,
Recorder: params.Recorder,
}
upgraded, upgradeErr := up.ManagedInstance(ctx, *changed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the upgrade was called in the HandleReconcileStatus() function?

My best guess is to update the Status.Version field? But that's already set in updateCollectorStatus(), which is called in HandleReconcileStatus().
Also, the upgrade ran here, but it did not update the (modified) CR in the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was to set the status field, yes. @jaronoff97 you refactored this back in 2023, do you recall if this is correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was to set the status. I believe this was also to keep the functionality the same during my refactor.

@swiatekm swiatekm requested review from swiatekm and jaronoff97 April 22, 2025 14:06
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a pretty fundamental change to this feature though, so I'd like more reviews before we merge. @open-telemetry/operator-approvers please have a look, especially if you worked on upgrades in the past.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, thanks for updating the logic 🙇 Were you seeing any of the the object has been modified; please apply your changes to the latest version and try again messages after this change?

@andreasgerstmayr
Copy link
Contributor Author

this makes sense to me, thanks for updating the logic 🙇 Were you seeing any of the the object has been modified; please apply your changes to the latest version and try again messages after this change?

I only saw this error message in the reconcile unit test, I didn't see it when I did manual testing using minikube.

@swiatekm swiatekm requested a review from iblancasa April 23, 2025 15:16
@swiatekm swiatekm merged commit 0f45702 into open-telemetry:main Apr 24, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator does not re-try failed upgrades
5 participants