[chore][deployment/databricks] Add debugging init script info (#6039)

crobert-1 · web-flow · commit 3bb6687befc4 · 2025-03-27T08:13:19.000-07:00
* [deployment/databricks] Add debugging init script info

* Update formatting and wording

* Changes requested by Josh

- Add information about org metric limits

* Add ending empty line
diff --git a/deployments/databricks/README.md b/deployments/databricks/README.md
@@ -89,3 +89,72 @@ The Databricks cluster provides a web terminal on the driver node. This is a BAS
 which can then be accessed to deploy the script.
 
 **Note: Investigation is ongoing to determine how to deploy the script on non-driver nodes.**
+
+## Debugging the Init Script
+
+From testing, the init script may fail for a variety of reasons, this section is meant
+to help users investigate the root cause of failing to get metrics from a Databricks cluster.
+
+### Init script setup
+
+As a first step of investigation, please ensure the init script has been properly configured.
+
+1. Ensure all required environment variables are set in the script.
+1. Ensure all required environment variables are set in the cluster configuration.
+1. Ensure the cluster is configured to run the init script on startup.
+
+### Situation 1: Cluster fails to start due to init script failure
+
+- Enable [init script logging](https://learn.microsoft.com/en-us/azure/databricks/init-scripts/logs)
+- Read through logs to see if any relevant information can be found
+
+### Situation 2: Cluster is running but no data is seen in charts (enabled web terminal)
+
+#### Pre-requisites
+
+- Enable the [web terminal](https://learn.microsoft.com/en-us/azure/databricks/admin/clusters/web-terminal)
+
+#### Investigate
+
+1. Access the [web terminal](https://learn.microsoft.com/en-us/azure/databricks/compute/web-terminal)
+
+1. Ensure the `splunk_otel_collector.service` is running
+
+    ```bash
+    $ systemctl # Check output for the service
+    ```
+
+1. Check contents of the Collector's configuration file
+
+   ```bash
+    $ cat /tmp/collector_download/config.yaml # This is the default location unless changed by user.
+    ```
+
+1. Check syslogs for possible errors coming from the Collector
+
+    ```bash
+    $ tail -n 50 /var/log/syslog
+    ```
+
+1. If the service is running, the configuration looks right, and nothing looks concerning from
+the syslogs, check the SignalFx backend for metrics to see if it's possibly a dashboard
+issue. Note that at the time of writing OOTB content has not been updated for OTel metrics.
+There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
+for Smart Agent metrics. The only charts that show data are a subset of host metric
+charts.
+
+1. Confirm metric time series (MTS) limits are not being hit for the organization.
+   - [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
+   - [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)
+
+### Situation 3: Cluster is running but no data is seen in charts (disabled web terminal)
+
+1. Check the SignalFx backend for metrics to see if it's possibly a dashboard issue.
+Note that at the time of writing OOTB content has not been updated for OTel metrics.
+There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
+for Smart Agent metrics. The only charts that show data are a subset of host metric
+charts.
+
+1. Confirm metric time series (MTS) limits are not being hit for the organization.
+   - [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
+   - [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)