Skip to content

Commit 3bb6687

Browse files
authored
[chore][deployment/databricks] Add debugging init script info (#6039)
* [deployment/databricks] Add debugging init script info * Update formatting and wording * Changes requested by Josh - Add information about org metric limits * Add ending empty line
1 parent 8c4693f commit 3bb6687

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

deployments/databricks/README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,3 +89,72 @@ The Databricks cluster provides a web terminal on the driver node. This is a BAS
8989
which can then be accessed to deploy the script.
9090

9191
**Note: Investigation is ongoing to determine how to deploy the script on non-driver nodes.**
92+
93+
## Debugging the Init Script
94+
95+
From testing, the init script may fail for a variety of reasons, this section is meant
96+
to help users investigate the root cause of failing to get metrics from a Databricks cluster.
97+
98+
### Init script setup
99+
100+
As a first step of investigation, please ensure the init script has been properly configured.
101+
102+
1. Ensure all required environment variables are set in the script.
103+
1. Ensure all required environment variables are set in the cluster configuration.
104+
1. Ensure the cluster is configured to run the init script on startup.
105+
106+
### Situation 1: Cluster fails to start due to init script failure
107+
108+
- Enable [init script logging](https://learn.microsoft.com/en-us/azure/databricks/init-scripts/logs)
109+
- Read through logs to see if any relevant information can be found
110+
111+
### Situation 2: Cluster is running but no data is seen in charts (enabled web terminal)
112+
113+
#### Pre-requisites
114+
115+
- Enable the [web terminal](https://learn.microsoft.com/en-us/azure/databricks/admin/clusters/web-terminal)
116+
117+
#### Investigate
118+
119+
1. Access the [web terminal](https://learn.microsoft.com/en-us/azure/databricks/compute/web-terminal)
120+
121+
1. Ensure the `splunk_otel_collector.service` is running
122+
123+
```bash
124+
$ systemctl # Check output for the service
125+
```
126+
127+
1. Check contents of the Collector's configuration file
128+
129+
```bash
130+
$ cat /tmp/collector_download/config.yaml # This is the default location unless changed by user.
131+
```
132+
133+
1. Check syslogs for possible errors coming from the Collector
134+
135+
```bash
136+
$ tail -n 50 /var/log/syslog
137+
```
138+
139+
1. If the service is running, the configuration looks right, and nothing looks concerning from
140+
the syslogs, check the SignalFx backend for metrics to see if it's possibly a dashboard
141+
issue. Note that at the time of writing OOTB content has not been updated for OTel metrics.
142+
There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
143+
for Smart Agent metrics. The only charts that show data are a subset of host metric
144+
charts.
145+
146+
1. Confirm metric time series (MTS) limits are not being hit for the organization.
147+
- [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
148+
- [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)
149+
150+
### Situation 3: Cluster is running but no data is seen in charts (disabled web terminal)
151+
152+
1. Check the SignalFx backend for metrics to see if it's possibly a dashboard issue.
153+
Note that at the time of writing OOTB content has not been updated for OTel metrics.
154+
There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
155+
for Smart Agent metrics. The only charts that show data are a subset of host metric
156+
charts.
157+
158+
1. Confirm metric time series (MTS) limits are not being hit for the organization.
159+
- [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
160+
- [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)

0 commit comments

Comments
 (0)