-
Notifications
You must be signed in to change notification settings - Fork 220
Operator should trigger the error state of the CR when deserialization fails #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@andreaTP given the type-safety nature of CRD-s this should basically never happen in general. I assume that this happens when there are multiple resources versions and no conversion hooks in place. I think we had an issue how to deal with this using labels. Consider also the situation that there is an Operator that manages the CR-s for whole cluster. Let's say there are no proper conversion hooks and/or validation in places, and such an error happens in one namespace, because the owner of that namespaces manages to create a CR that the operator is not able to handle. The operator however should still be able to manage the other custom resources on the cluster (for different namespaces / teams or any other custom resource). So the operator should not stop working in general if such an error happens. In general I think this is rather bug with the operator related setup rather than an issue with the operator. So again an operator should actually never see this ideally. But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error. So not sure even about the status update. Since this is not a problem with the reconciliation. Also not sure how we would do an update on a POJO if it cannot be deserialized. I see the value to notify the users through the status also in this case. But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach. Will think about that part, and see how it can be done. |
Correct, the situation described happens in case of "bugs" or misalignments in between implementations (
In this case, the issue is reproducible with a single version of a single CRD.
I understand this technical limitation, but we can think about triggering a synthetic |
This is what I meant with this:
In my experience this is what you have anyways on clusters or should have. Again not sure if there is an issues with de-serialization / serialization even the updates might not work in general using the POJOs, maybe with raw api with patching. But that again probably would need a specific error handling mechanism for this case.
Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there. But anyways thx for this bug report!! It's definitely worth to discuss if we should handle such errors or not, and if how. I will think about it, try to come up with a solution - probably as mentioned with raw API. |
I understand, still, not exposing any kind of evident issue makes the problem hard to identify and debug (e.g. when a user reports this kind of issue).
I would say that we can refer to this: fabric8io/kubernetes-client#3681 Happy to hear more feedback / opinions! 🙂 |
I'm not sure what the proper solution is in this case but I'm definitely against crashing the operator because that would leave the door open for malicious actors to craft invalid custom resources to take down the operator.
What do you mean by that? |
Fair, but we should find a way to notify that a problem occurred IMHO.
We can possibly emit an event, possibly on the user CR or, worst case, on the operator |
That's an interesting idea. I've never used events so I don't have experience with how they're used. However, they do seem short-lived so may be more easily missed than log inspection or alerting via monitoring? |
Usually events are also persisted ideally. |
Sorry for the late reply, @csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more! For this specific case I think that this is a decent UX:
In this way, people checking the CR itself will have the information about why the status is not getting updated. |
I thing there is no single best resource or definition but for example here: |
So I agree that this is useful to support. Problem is if we are not able to deserialize, we don't even know the resource ID (name + namespaces). But pretty sure there is a way around this too. |
@csviri instantiating an "untyped" (e.g. using |
yes, that is one way to approach it. |
Why an informer, though? Couldn't we just deserialise the failed CR with |
The failed CR doesn't reach back to the "user" code when an exception is thrown. |
How would a generic informer work, though? Would that mean having a constantly running informer watching all the resources? |
I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to |
That's what I meant by:
Though I guess I'm not sure how that would work because, indeed, we don't have access to the deserialisation that the informer does. |
Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: But, at the moment, using that mechanism would need to instantiate 2 informers per resource. |
That's what I was afraid of… 😭 |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
This is expected to be tackled as part of: #1422 |
It's not IMO, this is a separate issue. |
While I agree that we try to implement this with a callback, for the other we have now a agreed design for the first iteration: |
👍 I just tried to disable the |
Bug Report
When the deserialization of the CR fails the operator should go into the error state (eventually retry the reconcile loop and possibly update the status with the error)
What did you do?
An unrecognized field in a CR will cause the operator to fail the deserialization but the operator stays in running state
What did you expect to see?
The Operator would update the error status of the CR or at minimum, it should crash since an unmatched exception has been thrown.
What did you see instead? Under which circumstances?
The Operator should at least go in CrashLoopBackoff.
Environment
Kubernetes cluster type:
minikube
$ Mention java-operator-sdk version from pom.xml file
Quarkus SDK
3.0.7
$ java -version
Java 11
Reproduction
and
kubectl apply
this resource:Resulting StackTrace:
but the operator is still running.
The text was updated successfully, but these errors were encountered: