Re-try failed upgrades #3893

andreasgerstmayr · 2025-04-10T11:10:17Z

Description:
Currently, the operator upgrades all OpenTelemetryCollector CRs on startup of the operator. If any upgrade fails (e.g. intermittent errors of the Kubernetes API server), it won't be re-tried until the operator is restarted.

This commit moves the upgrade step to the reconcile loop. The operator reconciles all managed instances on startup, and in case of an error, re-tries the upgrade with exponential backoff.

Additionally, I changed the event type from Error to Warning, because "Error" is not a valid event type.

Link to tracking Issue(s):

Resolves: Operator does not re-try failed upgrades #3515

Testing:

Manual testing, existing upgrade tests.

Documentation:

Not required

Currently, the operator upgrades all OpenTelemetryCollector CRs on startup of the operator. If any upgrade fails (e.g. intermittent errors of the Kubernetes API server), it won't be re-tried until the operator is restarted. This commit moves the upgrade step to the reconcile loop. The operator reconciles all managed instances on startup, and in case of an error, re-tries the upgrade with exponential backoff. Additionally, I changed the event type from Error to Warning, because "Error" is not a valid event type. Signed-off-by: Andreas Gerstmayr <[email protected]>

iblancasa · 2025-04-10T11:48:20Z

main.go

@@ -509,23 +508,9 @@ func main() {
 	}
 }

-func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config, v version.Version) error {
+func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config) error {
 	// adds the upgrade mechanism to be executed once the manager is ready


I think this line can be removed now, right?

I didn't change the upgrade process for the Instrumentation CR.
The Instrumentation CR upgrade is a bit different, it doesn't use versions (I can't skip the upgrade because I don't know if it's up-to-date), therefore I left it as-is for now.

.chloggen/retry_upgrades.yaml

Signed-off-by: Andreas Gerstmayr <[email protected]>

swiatekm

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

On a separate note, we definitely need more tests for this change.

controllers/opentelemetrycollector_controller.go

…on on upgrade Signed-off-by: Andreas Gerstmayr <[email protected]>

Signed-off-by: Andreas Gerstmayr <[email protected]>

andreasgerstmayr · 2025-04-11T17:12:15Z

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

#3515 has more context why it's required in the reconcile loop. tl;dr

outdated instances should be upgraded when the management state is switched from unmanaged back to managed
re-trying failed upgrades with exponential backoff comes for free when it's in the reconcile loop
I'd argue the reconcile should never run before the CR is up-to-date, because the manifest generation code should be able to assume that the CR is up-to-date.

On a separate note, we definitely need more tests for this change.

Ok, I'll add some next week.

Signed-off-by: Andreas Gerstmayr <[email protected]>

.chloggen/retry_upgrades.yaml

pkg/collector/upgrade/upgrade.go

swiatekm · 2025-04-14T17:46:22Z

The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.

#3515 has more context why it's required in the reconcile loop. tl;dr
* outdated instances should be upgraded when the management state is switched from unmanaged back to managed

* re-trying failed upgrades with exponential backoff comes for free when it's in the reconcile loop

* I'd argue the reconcile should never run before the CR is up-to-date, because the manifest generation code should be able to assume that the CR is up-to-date.
On a separate note, we definitely need more tests for this change.

Ok, I'll add some next week.

Allright, that makes sense. In that case, the next best thing is for a single reconciliation to only do one thing. Either upgrade, or generate and apply new manifests, never both. This will also make testing much easier.

When it comes to testing, we do have some reconciliation tests running against a real API Server via envtest, but they're painful to write due to needing to manually wait for each condition to be reflected in controller-runtime's caching K8s client. So I won't blame you if you only add a chainsaw e2e test.

Signed-off-by: Andreas Gerstmayr <[email protected]>

andreasgerstmayr · 2025-04-18T14:44:14Z

When it comes to testing, we do have some reconciliation tests running against a real API Server via envtest, but they're painful to write due to needing to manually wait for each condition to be reflected in controller-runtime's caching K8s client. So I won't blame you if you only add a chainsaw e2e test.

I gave it a try with envtest, and hit one error: the object has been modified; please apply your changes to the latest version and try again thrown at line

opentelemetry-operator/controllers/opentelemetrycollector_controller.go

Line 269 in 2a07501

err = r.Update(ctx, &instance)

after the first reconciliation.

I think it has to do with caching, as you mentioned. Fetching the object again resolved the (intermittent) issue.

I added three tests, one for upgrade, one already-up-to-date and one with an empty version in the status field.
Let me know if I should add more test cases.

andreasgerstmayr · 2025-04-18T14:49:00Z

internal/status/collector/handle.go

-		Client:   params.Client,
-		Recorder: params.Recorder,
-	}
-	upgraded, upgradeErr := up.ManagedInstance(ctx, *changed)


Not sure why the upgrade was called in the HandleReconcileStatus() function?

My best guess is to update the Status.Version field? But that's already set in updateCollectorStatus(), which is called in HandleReconcileStatus().
Also, the upgrade ran here, but it did not update the (modified) CR in the cluster.

I think it was to set the status field, yes. @jaronoff97 you refactored this back in 2023, do you recall if this is correct?

yes, this was to set the status. I believe this was also to keep the functionality the same during my refactor.

swiatekm

LGTM. This is a pretty fundamental change to this feature though, so I'd like more reviews before we merge. @open-telemetry/operator-approvers please have a look, especially if you worked on upgrades in the past.

jaronoff97

this makes sense to me, thanks for updating the logic 🙇 Were you seeing any of the the object has been modified; please apply your changes to the latest version and try again messages after this change?

andreasgerstmayr · 2025-04-23T15:04:11Z

this makes sense to me, thanks for updating the logic 🙇 Were you seeing any of the the object has been modified; please apply your changes to the latest version and try again messages after this change?

I only saw this error message in the reconcile unit test, I didn't see it when I did manual testing using minikube.

andreasgerstmayr requested a review from a team as a code owner April 10, 2025 11:10

Merge remote-tracking branch 'upstream/main' into retry-failed-upgrades

8144157

iblancasa reviewed Apr 10, 2025

View reviewed changes

add description to changelog

aee6da8

Signed-off-by: Andreas Gerstmayr <[email protected]>

swiatekm reviewed Apr 11, 2025

View reviewed changes

controllers/opentelemetrycollector_controller.go Outdated Show resolved Hide resolved

controllers/opentelemetrycollector_controller.go Outdated Show resolved Hide resolved

pavolloffay approved these changes Apr 11, 2025

View reviewed changes

controllers/opentelemetrycollector_controller.go Outdated Show resolved Hide resolved

controllers/opentelemetrycollector_controller.go Outdated Show resolved Hide resolved

andreasgerstmayr added 4 commits April 11, 2025 18:24

extract upgrade to separate methods, return and re-queue reconciliati…

55ecd80

…on on upgrade Signed-off-by: Andreas Gerstmayr <[email protected]>

move NeedsUpgrade to upgrade package

abef62b

Signed-off-by: Andreas Gerstmayr <[email protected]>

re-add ManagementState and UpgradeStrategy check to Upgrade function

f942c0d

Signed-off-by: Andreas Gerstmayr <[email protected]>

Merge remote-tracking branch 'upstream/main' into retry-failed-upgrades

6145d11

check NeedsUpgrade() in Upgrade() fn

04467b5

Signed-off-by: Andreas Gerstmayr <[email protected]>

pavolloffay reviewed Apr 14, 2025

View reviewed changes

.chloggen/retry_upgrades.yaml Outdated Show resolved Hide resolved

pkg/collector/upgrade/upgrade.go Outdated Show resolved Hide resolved

andreasgerstmayr added 2 commits April 18, 2025 16:27

add unit tests, remove upgrade from status handling

919fdd9

Signed-off-by: Andreas Gerstmayr <[email protected]>

fix tests

eb47299

Signed-off-by: Andreas Gerstmayr <[email protected]>

andreasgerstmayr commented Apr 18, 2025

View reviewed changes

swiatekm requested review from swiatekm and jaronoff97 April 22, 2025 14:06

swiatekm approved these changes Apr 22, 2025

View reviewed changes

jaronoff97 approved these changes Apr 23, 2025

View reviewed changes

swiatekm requested a review from iblancasa April 23, 2025 15:16

swiatekm merged commit 0f45702 into open-telemetry:main Apr 24, 2025
43 checks passed

Re-try failed upgrades #3893

Re-try failed upgrades #3893

Uh oh!

Conversation

andreasgerstmayr commented Apr 10, 2025

Uh oh!

iblancasa Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

andreasgerstmayr Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreasgerstmayr commented Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!

swiatekm commented Apr 14, 2025

Uh oh!

andreasgerstmayr commented Apr 18, 2025

Uh oh!

andreasgerstmayr Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

jaronoff97 Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

jaronoff97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreasgerstmayr commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

jaronoff97 left a comment •

edited

Loading