@@ -89,3 +89,72 @@ The Databricks cluster provides a web terminal on the driver node. This is a BAS
89
89
which can then be accessed to deploy the script.
90
90
91
91
** Note: Investigation is ongoing to determine how to deploy the script on non-driver nodes.**
92
+
93
+ ## Debugging the Init Script
94
+
95
+ From testing, the init script may fail for a variety of reasons, this section is meant
96
+ to help users investigate the root cause of failing to get metrics from a Databricks cluster.
97
+
98
+ ### Init script setup
99
+
100
+ As a first step of investigation, please ensure the init script has been properly configured.
101
+
102
+ 1 . Ensure all required environment variables are set in the script.
103
+ 1 . Ensure all required environment variables are set in the cluster configuration.
104
+ 1 . Ensure the cluster is configured to run the init script on startup.
105
+
106
+ ### Situation 1: Cluster fails to start due to init script failure
107
+
108
+ - Enable [ init script logging] ( https://learn.microsoft.com/en-us/azure/databricks/init-scripts/logs )
109
+ - Read through logs to see if any relevant information can be found
110
+
111
+ ### Situation 2: Cluster is running but no data is seen in charts (enabled web terminal)
112
+
113
+ #### Pre-requisites
114
+
115
+ - Enable the [ web terminal] ( https://learn.microsoft.com/en-us/azure/databricks/admin/clusters/web-terminal )
116
+
117
+ #### Investigate
118
+
119
+ 1 . Access the [ web terminal] ( https://learn.microsoft.com/en-us/azure/databricks/compute/web-terminal )
120
+
121
+ 1 . Ensure the ` splunk_otel_collector.service ` is running
122
+
123
+ ``` bash
124
+ $ systemctl # Check output for the service
125
+ ```
126
+
127
+ 1. Check contents of the Collector' s configuration file
128
+
129
+ ```bash
130
+ $ cat /tmp/collector_download/config.yaml # This is the default location unless changed by user.
131
+ ```
132
+
133
+ 1. Check syslogs for possible errors coming from the Collector
134
+
135
+ ```bash
136
+ $ tail -n 50 /var/log/syslog
137
+ ```
138
+
139
+ 1. If the service is running, the configuration looks right, and nothing looks concerning from
140
+ the syslogs, check the SignalFx backend for metrics to see if it' s possibly a dashboard
141
+ issue. Note that at the time of writing OOTB content has not been updated for OTel metrics.
142
+ There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
143
+ for Smart Agent metrics. The only charts that show data are a subset of host metric
144
+ charts.
145
+
146
+ 1. Confirm metric time series (MTS) limits are not being hit for the organization.
147
+ - [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
148
+ - [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)
149
+
150
+ # ## Situation 3: Cluster is running but no data is seen in charts (disabled web terminal)
151
+
152
+ 1. Check the SignalFx backend for metrics to see if it' s possibly a dashboard issue.
153
+ Note that at the time of writing OOTB content has not been updated for OTel metrics.
154
+ There is currently no OOTB content for Databricks, and the Apache Spark dashboard is
155
+ for Smart Agent metrics. The only charts that show data are a subset of host metric
156
+ charts.
157
+
158
+ 1. Confirm metric time series (MTS) limits are not being hit for the organization.
159
+ - [MTS default limits per product](https://docs.splunk.com/observability/en/admin/references/per-product-limits.html#mts-limits-per-product)
160
+ - [Access organization metrics](https://docs.splunk.com/observability/en/admin/org-metrics.html#org-metrics)
0 commit comments