Operator should trigger the error state of the CR when deserialization fails #1170

andreaTP · 2022-04-21T15:06:41Z

Bug Report

When the deserialization of the CR fails the operator should go into the error state (eventually retry the reconcile loop and possibly update the status with the error)

What did you do?

An unrecognized field in a CR will cause the operator to fail the deserialization but the operator stays in running state

What did you expect to see?

The Operator would update the error status of the CR or at minimum, it should crash since an unmatched exception has been thrown.

What did you see instead? Under which circumstances?

The Operator should at least go in CrashLoopBackoff.

Environment

Kubernetes cluster type:

minikube

$ Mention java-operator-sdk version from pom.xml file

Quarkus SDK 3.0.7

$ java -version

Java 11

Reproduction

kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
kubectl apply -f https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/18.0.0/kubernetes/kubernetes.yml

and kubectl apply this resource:

apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
  name: example-keycloak
spec:
  disableDefaultIngress: false
  hostname: INSECURE-DISABLE
  tlsSecret: INSECURE-DISABLE

Resulting StackTrace:

2022-04-21 15:05:31,315 ERROR [io.fab.kub.cli.dsl.int.AbstractWatchManager] (OkHttp https://10.96.0.1/...) Invalid event type: java.lang.IllegalArgumentException: Failed to deserialize WatchEvent
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.contextAwareWatchEventDeserializer(AbstractWatchManager.java:253)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.readWatchEvent(AbstractWatchManager.java:259)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:284)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:68)
	at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$BuilderImpl$1.onMessage(OkHttpWebSocketImpl.java:97)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "disableDefaultIngress" (class org.keycloak.operator.v2alpha1.crds.KeycloakSpec), not marked as ignorable (7 known properties: "defaultIngressDisabled", "serverConfiguration", "unsupported", "image", "instances", "hostname", "tlsSecret"])
 at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain: org.keycloak.operator.v2alpha1.crds.Keycloak["spec"]->org.keycloak.operator.v2alpha1.crds.KeycloakSpec["disableDefaultIngress"])
	at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61)
	at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:1127)
	at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1989)
	at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1700)
	at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1678)
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:319)
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:176)
	at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
	at io.fabric8.kubernetes.client.utils.serialization.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:131)
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:313)
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:176)
	at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322)
	at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4650)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2831)
	at com.fasterxml.jackson.databind.ObjectMapper.treeToValue(ObjectMapper.java:3295)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.contextAwareWatchEventDeserializer(AbstractWatchManager.java:248)
	... 14 more

but the operator is still running.

The text was updated successfully, but these errors were encountered:

csviri · 2022-04-22T13:15:59Z

@andreaTP given the type-safety nature of CRD-s this should basically never happen in general.
If there are some additional validations in addition to OpenAPI validation that could be also handled with admission hooks. So there are all there is all the tooling in Kubernetes to make sure such thing never happens. And it should be like this on the cluster.

I assume that this happens when there are multiple resources versions and no conversion hooks in place. I think we had an issue how to deal with this using labels.

Consider also the situation that there is an Operator that manages the CR-s for whole cluster. Let's say there are no proper conversion hooks and/or validation in places, and such an error happens in one namespace, because the owner of that namespaces manages to create a CR that the operator is not able to handle. The operator however should still be able to manage the other custom resources on the cluster (for different namespaces / teams or any other custom resource). So the operator should not stop working in general if such an error happens.

In general I think this is rather bug with the operator related setup rather than an issue with the operator. So again an operator should actually never see this ideally.

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error. So not sure even about the status update. Since this is not a problem with the reconciliation. Also not sure how we would do an update on a POJO if it cannot be deserialized.

I see the value to notify the users through the status also in this case. But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach. Will think about that part, and see how it can be done.

andreaTP · 2022-04-22T13:29:33Z

given the type-safety nature of CRD-s this should basically never happen in general.

Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated CR / jackson deserialization in this specific case).

I assume that this happens when there are multiple resources versions and no conversion hooks in place.

In this case, the issue is reproducible with a single version of a single CRD.

But this happens unfortunately outside of reconciliation loop so basically handling such error would require quite specific approach.

I understand this technical limitation, but we can think about triggering a synthetic updateErrorStatus of the Controller when an Exception is thrown by the Informers.
I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.

csviri · 2022-04-22T13:42:11Z

I do believe that, for production-grade Operators, we should be able to somehow show(and propagate to the user logic) the fact that something is going wrong and avoid swallowing and silently ignoring the exceptions.

This is what I meant with this:

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.

In my experience this is what you have anyways on clusters or should have. Again not sure if there is an issues with de-serialization / serialization even the updates might not work in general using the POJOs, maybe with raw api with patching. But that again probably would need a specific error handling mechanism for this case.

Correct, the situation described happens in case of "bugs" or misalignments in between implementations ( crd-generated CR / jackson deserialization in this specific case).

Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.

But anyways thx for this bug report!!

It's definitely worth to discuss if we should handle such errors or not, and if how. I will think about it, try to come up with a solution - probably as mentioned with raw API.

Would be good see others opinion @jmrodri @metacosm .

andreaTP · 2022-04-22T14:20:38Z

But if such error is present, IMO on the cluster there should be proper log aggregation (think ELK) and related alerting that notifies the platform engineer of such an error.

I understand, still, not exposing any kind of evident issue makes the problem hard to identify and debug (e.g. when a user reports this kind of issue).
Another possible idea might be to leverage Kubernetes Events?

Could you please create an issue for fabric8 client? If there is a bug in generator that should be fixed there.

I would say that we can refer to this: fabric8io/kubernetes-client#3681

Happy to hear more feedback / opinions! 🙂

metacosm · 2022-04-22T14:35:01Z

I'm not sure what the proper solution is in this case but I'm definitely against crashing the operator because that would leave the door open for malicious actors to craft invalid custom resources to take down the operator.

Another possible idea might be to leverage Kubernetes Events?

What do you mean by that?

andreaTP · 2022-04-22T14:48:50Z

I'm definitely against crashing the operator

Fair, but we should find a way to notify that a problem occurred IMHO.

What do you mean by that?

We can possibly emit an event, possibly on the user CR or, worst case, on the operator Deployment itself containing the relevant information.
This way the issue will be easier to spot using commands like kubectl get events instead of having to eyeball on logs.

metacosm · 2022-04-22T15:11:57Z

What do you mean by that?

We can possibly emit an event, possibly on the user CR or, worst case, on the operator Deployment itself containing the relevant information. This way the issue will be easier to spot using commands like kubectl get events instead of having to eyeball on logs.

That's an interesting idea. I've never used events so I don't have experience with how they're used. However, they do seem short-lived so may be more easily missed than log inspection or alerting via monitoring?

csviri · 2022-04-22T15:20:23Z

Usually events are also persisted ideally.
But are used usually to propagate information about the cluster state. Typically if a pod cannot start for some reason, there will be no logs, so an event is created for example. Or more information about nodes, or the kube proxy etc.
I'm would say using them here would be a rather a mis-use, but don't have very strong opinion :)

andreaTP · 2022-04-27T08:52:19Z

Sorry for the late reply,

@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!

For this specific case I think that this is a decent UX:

a CR is created by the user
an unexpected exception is thrown in the operator informer when trying to access the resource
an event is emitted on the CR marking it as "failed" (or something on this line)

In this way, people checking the CR itself will have the information about why the status is not getting updated.

csviri · 2022-04-27T09:09:45Z

@csviri do you have any link regarding the usage of events solely for "cluster state" events? Super interested in understanding this more!

I thing there is no single best resource or definition but for example here:
https://www.cncf.io/blog/2020/12/10/the-top-kubernetes-apis-for-cloud-native-observability-part-1-the-kubernetes-metrics-service-container-apis-3

csviri · 2022-04-27T09:11:22Z

So I agree that this is useful to support. Problem is if we are not able to deserialize, we don't even know the resource ID (name + namespaces). But pretty sure there is a way around this too.

andreaTP · 2022-04-27T09:32:33Z

@csviri instantiating an "untyped" (e.g. using GenericKubernetesResource) Informer might be one way of doing it.

csviri · 2022-04-27T10:34:26Z

yes, that is one way to approach it.

metacosm · 2022-05-02T15:05:44Z

Why an informer, though? Couldn't we just deserialise the failed CR with GenericKubernetesResource?

andreaTP · 2022-05-03T07:58:46Z

The failed CR doesn't reach back to the "user" code when an exception is thrown.

metacosm · 2022-05-03T12:08:08Z

How would a generic informer work, though? Would that mean having a constantly running informer watching all the resources?

csviri · 2022-05-03T12:42:17Z

I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to GenericKubernetesResource. And the error handler could work with that from that point.

metacosm · 2022-05-03T13:59:52Z

I think what @andreaTP means that, when an error occurs during de-serialization of a resource, we could try to de-serialize it to GenericKubernetesResource. And the error handler could work with that from that point.

That's what I meant by:

Couldn't we just deserialise the failed CR with GenericKubernetesResource?

Though I guess I'm not sure how that would work because, indeed, we don't have access to the deserialisation that the informer does.

andreaTP · 2022-05-03T16:46:40Z

Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this:
fabric8io/kubernetes-client#3786

But, at the moment, using that mechanism would need to instantiate 2 informers per resource.

metacosm · 2022-05-03T16:50:23Z

Implementation wise there might be a few challenges, last time I picked up something along those lines I ended up with this: fabric8io/kubernetes-client#3786

But, at the moment, using that mechanism would need to instantiate 2 informers per resource.

That's what I was afraid of… 😭

github-actions · 2022-08-29T03:59:35Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

andreaTP · 2022-08-29T09:18:04Z

This is expected to be tackled as part of: #1422

csviri · 2022-08-29T09:25:41Z

This is expected to be tackled as part of: #1422

It's not IMO, this is a separate issue.
The if an informer is not able to de-serialize a resource (probably because of a mis configuration or missing conversion hooks) it a completely different problem compared the the case when there is no permission to the resource. In first case we have the resource at hand to handle (maybe with the raw api), in other case don't have any resource.

csviri · 2022-08-29T09:27:01Z

While I agree that we try to implement this with a callback, for the other we have now a agreed design for the first iteration:
#1422 (comment)

andreaTP · 2022-08-29T09:56:31Z

👍 I just tried to disable the stale condition 🙂

andreaTP mentioned this issue Apr 21, 2022

disableDefaultIngress of the Keycloak CR doesn't work keycloak/keycloak#11577

Closed

csviri added this to the 3.2 milestone Jun 29, 2022

andreaTP mentioned this issue Aug 24, 2022

[bug] Operator becoming non-functional after transient RBAC changes #1419

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022

csviri removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022

csviri mentioned this issue Sep 2, 2022

feat: add possibility to add an exception handler to Watchers fabric8io/kubernetes-client#4365

Merged

11 tasks

csviri modified the milestones: 3.2, 3.3 Sep 5, 2022

csviri linked a pull request Sep 23, 2022 that will close this issue

feat: enable configuring a handler to listen to informers stopping #1493

Merged

csviri closed this as completed Oct 26, 2022

Operator should trigger the error state of the CR when deserialization fails #1170

Operator should trigger the error state of the CR when deserialization fails #1170

Comments

andreaTP commented Apr 21, 2022

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Reproduction

csviri commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreaTP commented Apr 22, 2022

Uh oh!

csviri commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreaTP commented Apr 22, 2022

Uh oh!

metacosm commented Apr 22, 2022

Uh oh!

andreaTP commented Apr 22, 2022

Uh oh!

metacosm commented Apr 22, 2022

Uh oh!

csviri commented Apr 22, 2022

Uh oh!

andreaTP commented Apr 27, 2022

Uh oh!

csviri commented Apr 27, 2022

Uh oh!

csviri commented Apr 27, 2022

Uh oh!

andreaTP commented Apr 27, 2022

Uh oh!

csviri commented Apr 27, 2022

Uh oh!

metacosm commented May 2, 2022

Uh oh!

andreaTP commented May 3, 2022

Uh oh!

metacosm commented May 3, 2022

Uh oh!

csviri commented May 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metacosm commented May 3, 2022

Uh oh!

andreaTP commented May 3, 2022

Uh oh!

metacosm commented May 3, 2022

Uh oh!

github-actions bot commented Aug 29, 2022

Uh oh!

andreaTP commented Aug 29, 2022

Uh oh!

csviri commented Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csviri commented Aug 29, 2022

Uh oh!

andreaTP commented Aug 29, 2022

Uh oh!

csviri commented Apr 22, 2022 •

edited

Loading

csviri commented Apr 22, 2022 •

edited

Loading

csviri commented May 3, 2022 •

edited

Loading

csviri commented Aug 29, 2022 •

edited

Loading