Skip to content

Commit 968ce08

Browse files
[chore] Expand operator troubleshooting docs (#1812)
* Expand operator troubleshooting docs * Update docs/auto-instrumentation-install.md Co-authored-by: pszkamruk-splunk <[email protected]> * patch * patch --------- Co-authored-by: pszkamruk-splunk <[email protected]>
1 parent db44b97 commit 968ce08

File tree

1 file changed

+126
-22
lines changed

1 file changed

+126
-22
lines changed

docs/auto-instrumentation-install.md

Lines changed: 126 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -78,15 +78,15 @@ helm install splunk-otel-collector -f ./my_values.yaml --set operatorcrds.instal
7878
kubectl get pods
7979
# NAME READY STATUS
8080
# splunk-otel-collector-agent-lfthw 2/2 Running
81-
# splunk-otel-collector-cert-manager-6b9fb8b95f-2lmv4 1/1 Running
82-
# splunk-otel-collector-cert-manager-cainjector-6d65b6d4c-khcrc 1/1 Running
83-
# splunk-otel-collector-cert-manager-webhook-87b7ffffc-xp4sr 1/1 Running
8481
# splunk-otel-collector-k8s-cluster-receiver-856f5fbcf9-pqkwg 1/1 Running
8582
# splunk-otel-collector-opentelemetry-operator-56c4ddb4db-zcjgh 2/2 Running
8683

87-
kubectl get mutatingwebhookconfiguration.admissionregistration.k8s.io
84+
kubectl get validatingwebhookconfiguration
85+
# NAME WEBHOOKS AGE
86+
# splunk-otel-collector-opentelemetry-operator-admission 3 14m
87+
88+
kubectl get mutatingwebhookconfiguration
8889
# NAME WEBHOOKS AGE
89-
# splunk-otel-collector-cert-manager-webhook 1 14m
9090
# splunk-otel-collector-opentelemetry-operator-mutation 3 14m
9191

9292
kubectl get otelinst
@@ -554,23 +554,127 @@ This method allows you to use a certificate that is trusted by external systems,
554554

555555
For more advanced use cases, refer to the [official Helm chart documentation](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-operator/values.yaml) for detailed configuration options and scenarios.
556556

557-
### Troubleshooting the Operator and Cert Manager
558-
559-
#### Check the logs for failures
560-
561-
**Operator Logs:**
562-
563-
```bash
564-
kubectl logs -l app.kubernetes.io/name=operator
565-
```
566-
567-
**Cert-Manager Logs:**
568-
569-
```bash
570-
kubectl logs -l app=certmanager
571-
kubectl logs -l app=cainjector
572-
kubectl logs -l app=webhook
573-
```
557+
### Troubleshooting the Operator
558+
559+
#### General Debugging Steps
560+
In the following steps, the "operator namespace" refers to the namespace where the operator is deployed,
561+
which is the same namespace as the chart. The "API server namespace" usually defaults to `kube-system`,
562+
but this may vary depending on your Kubernetes distribution. If a namespace parameter is not explicitly
563+
provided, assume it refers to the operator or chart's namespace.
564+
565+
- Check the logs for the operator to identify any issues:
566+
```bash
567+
kubectl logs -l app.kubernetes.io/name=operator
568+
```
569+
- The operator webhooks must communicate with the Kubernetes API server. Errors related to webhook usage can often be found in the API server logs:
570+
- For self-managed clusters, check logs directly:
571+
```bash
572+
kubectl logs -n <apiserver-namespace> -l component=kube-apiserver
573+
```
574+
- For managed clusters, follow the platform-specific steps to enable and view API server logs:
575+
- [AKS: Monitor Logs](https://learn.microsoft.com/en-us/azure/aks/monitor-aks?tabs=cilium)
576+
- [EKS: Enable or Disable Control Plane Logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html)
577+
- [GKE: View Logs](https://cloud.google.com/kubernetes-engine/docs/how-to/view-logs)
578+
- [OpenShift: Logging and Monitoring](https://docs.openshift.com/container-platform/latest/logging/cluster-logging.html)
579+
- If using certmanager for TLS certificates, check its logs for issues:
580+
```bash
581+
kubectl logs -l app=certmanager
582+
kubectl logs -l app=cainjector
583+
kubectl logs -l app=webhook
584+
```
585+
- **Verify Webhook Configurations**:
586+
Check `MutatingWebhookConfiguration` and `ValidatingWebhookConfiguration`:
587+
```bash
588+
kubectl get mutatingwebhookconfiguration
589+
kubectl get validatingwebhookconfiguration
590+
```
591+
- **Inspect Network Policies**:
592+
Ensure there are no network policies blocking communication between the namespace where the operator
593+
resides and the namespace where the Kubernetes apiserver resides.
594+
```bash
595+
kubectl get networkpolicy -n <operator-namespace>
596+
kubectl get networkpolicy -n <apiserver-namespace>
597+
```
598+
#### Checking Operator <-> API Server Connectivity steps
599+
Test Operator to API Server Connection
600+
1. Create a `busybox` pod in the Operator's namespace:
601+
```bash
602+
kubectl run busybox-test --rm -it --restart=Never -n <operator-namespace> --image=busybox -- /bin/sh
603+
```
604+
2. Enter the `busybox` pod:
605+
```bash
606+
kubectl exec -it busybox-test -n <operator-namespace> -- /bin/sh
607+
```
608+
3. Attempt to contact the API Server:
609+
```bash
610+
wget --spider https://kubernetes.default.svc
611+
```
612+
4. If the connection fails, investigate:
613+
- Network policies in the Operator's namespace.
614+
- Service account permissions.
615+
616+
Test API Server to Operator Webhook Connection
617+
1. Create a `busybox` pod in the API Server's namespace:
618+
```bash
619+
kubectl run busybox-test --rm -it --restart=Never -n <apiserver-namespace> --image=busybox -- /bin/sh
620+
```
621+
2. Enter the `busybox` pod:
622+
```bash
623+
kubectl exec -it busybox-test -n <apiserver-namespace> -- /bin/sh
624+
```
625+
3. Attempt to contact the Operator's webhook:
626+
```bash
627+
wget --spider http://<operator-webhook-service>.<operator-namespace>.svc.cluster.local
628+
```
629+
4. If the connection fails, investigate:
630+
- The `Service` and `Endpoints` for the Operator webhook.
631+
- Network policies in the Operator's namespace.
632+
633+
### Known Issues
634+
635+
**Custom Network Policies or Security Layers**
636+
- **Cause:** Tools like Calico, Cilium, or custom firewalls may block communication between the API
637+
server and the operator webhook.
638+
- **Resolution:**
639+
- Before reaching out to Splunk Support, consult with your infrastructure or platform
640+
team who set up your cluster. They may have implemented custom network policies or security layers
641+
that could be affecting communication.
642+
- If you are using networking or security solutions from a third-party Kubernetes solution provider,
643+
be aware that these may include configurations or custom CRDs that can impact this operator's
644+
functionality. Since these configurations vary widely per provider, we cannot provide specific
645+
guidance for every product here. We recommend reviewing the providers configurations, CRD definitions,
646+
and deployed CRD instances in your cluster to identify any settings related to networking or
647+
security that might interfere with communication between the operator and the Kubernetes API server.
648+
```bash
649+
kubectl get crds
650+
kubectl get <crd-name> --all-namespaces
651+
kubectl get <crd-name> -n <namespace> -o yaml
652+
```
653+
654+
**[EKS/Cilium] API Server Error: "No endpoints available for service 'splunk-otel-collector-operator-webhook'"**
655+
- **Cause:** This is a general known issue in setups where the Kubernetes control plane cannot communicate
656+
with admission webhooks, such as the operator's webhook, in other namespaces. This occurs because
657+
the customer has deployed a custom networking solution (e.g., Cilium in overlay mode) that restricts
658+
the expected communication between the control plane and webhooks that are not a part of the control
659+
plane. The issue is not caused by the operator itself but by the limitations of the custom networking configuration.
660+
- **Resolution:**
661+
- **Solution 1: Enable ENI Mode in Cilium**
662+
- Update the AWS Cilium setup to use ENI mode. This configuration allows the control plane to communicate
663+
with webhooks in other namespaces. Refer to the [Cilium ENI Documentation](https://docs.cilium.io/en/stable/gettingstarted/eni/).
664+
- **Solution 2: Run the Operator in Host Network Mode**
665+
- Modify the `splunk-otel-collector-chart` Helm chart values to enable host network mode for the operator:
666+
```yaml
667+
operator:
668+
hostNetwork: true
669+
```
670+
- Apply the updated Helm chart configuration and redeploy the operator.
671+
- **Note:** While this workaround resolves the issue, running the operator in host network mode is
672+
considered a less secure practice and thus the 1st solution would be more favorable for security.
673+
674+
- **Related Links:**
675+
- [Cilium Issue #21959 How to use an admission webhook with Cilium?](https://github.com/cilium/cilium/issues/21959)
676+
- [OpenTelemetry Operator Issue #2260 Webhook "address is not allowed" when creating an Instrumentation on EKS](https://github.com/open-telemetry/opentelemetry-operator/issues/2260)
677+
- [Cilium Issue #30111 EKS Cilium in Overlay with ALB and webhooks: Address is not allowed](https://github.com/cilium/cilium/issues/30111)
574678

575679
### Documentation Resources
576680

0 commit comments

Comments
 (0)