-
Notifications
You must be signed in to change notification settings - Fork 540
Re-try failed upgrades #3893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-try failed upgrades #3893
Conversation
Currently, the operator upgrades all OpenTelemetryCollector CRs on startup of the operator. If any upgrade fails (e.g. intermittent errors of the Kubernetes API server), it won't be re-tried until the operator is restarted. This commit moves the upgrade step to the reconcile loop. The operator reconciles all managed instances on startup, and in case of an error, re-tries the upgrade with exponential backoff. Additionally, I changed the event type from Error to Warning, because "Error" is not a valid event type. Signed-off-by: Andreas Gerstmayr <[email protected]>
@@ -509,23 +508,9 @@ func main() { | |||
} | |||
} | |||
|
|||
func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config, v version.Version) error { | |||
func addDependencies(_ context.Context, mgr ctrl.Manager, cfg config.Config) error { | |||
// adds the upgrade mechanism to be executed once the manager is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line can be removed now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the upgrade process for the Instrumentation
CR.
The Instrumentation
CR upgrade is a bit different, it doesn't use versions (I can't skip the upgrade because I don't know if it's up-to-date), therefore I left it as-is for now.
Signed-off-by: Andreas Gerstmayr <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me, but I'm not fully convinced doing this during reconciliation is a good idea. Right now, we can make the very strong assumption that reconciliation does not modify the source CR, which this PR removes. Not sure that's worth it. @open-telemetry/operator-approvers I'd like to get more opinions about this change.
On a separate note, we definitely need more tests for this change.
…on on upgrade Signed-off-by: Andreas Gerstmayr <[email protected]>
Signed-off-by: Andreas Gerstmayr <[email protected]>
Signed-off-by: Andreas Gerstmayr <[email protected]>
#3515 has more context why it's required in the reconcile loop. tl;dr
Ok, I'll add some next week. |
Signed-off-by: Andreas Gerstmayr <[email protected]>
Allright, that makes sense. In that case, the next best thing is for a single reconciliation to only do one thing. Either upgrade, or generate and apply new manifests, never both. This will also make testing much easier. When it comes to testing, we do have some reconciliation tests running against a real API Server via envtest, but they're painful to write due to needing to manually wait for each condition to be reflected in controller-runtime's caching K8s client. So I won't blame you if you only add a chainsaw e2e test. |
Signed-off-by: Andreas Gerstmayr <[email protected]>
Signed-off-by: Andreas Gerstmayr <[email protected]>
I gave it a try with envtest, and hit one error:
I think it has to do with caching, as you mentioned. Fetching the object again resolved the (intermittent) issue. I added three tests, one for upgrade, one already-up-to-date and one with an empty version in the status field. |
Client: params.Client, | ||
Recorder: params.Recorder, | ||
} | ||
upgraded, upgradeErr := up.ManagedInstance(ctx, *changed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why the upgrade was called in the HandleReconcileStatus()
function?
My best guess is to update the Status.Version
field? But that's already set in updateCollectorStatus()
, which is called in HandleReconcileStatus()
.
Also, the upgrade ran here, but it did not update the (modified) CR in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was to set the status field, yes. @jaronoff97 you refactored this back in 2023, do you recall if this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was to set the status. I believe this was also to keep the functionality the same during my refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is a pretty fundamental change to this feature though, so I'd like more reviews before we merge. @open-telemetry/operator-approvers please have a look, especially if you worked on upgrades in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me, thanks for updating the logic 🙇 Were you seeing any of the the object has been modified; please apply your changes to the latest version and try again
messages after this change?
I only saw this error message in the reconcile unit test, I didn't see it when I did manual testing using minikube. |
Description:
Currently, the operator upgrades all OpenTelemetryCollector CRs on startup of the operator. If any upgrade fails (e.g. intermittent errors of the Kubernetes API server), it won't be re-tried until the operator is restarted.
This commit moves the upgrade step to the reconcile loop. The operator reconciles all managed instances on startup, and in case of an error, re-tries the upgrade with exponential backoff.
Additionally, I changed the event type from Error to Warning, because "Error" is not a valid event type.
Link to tracking Issue(s):
Testing:
Manual testing, existing upgrade tests.
Documentation:
Not required