Skip to content

Commit 38f6006

Browse files
authored
feat: add support for databricks (#169)
1 parent 5003bc9 commit 38f6006

File tree

20 files changed

+2191
-198
lines changed

20 files changed

+2191
-198
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ init:
1010
[ -d $(VENV) ] || python3 -m venv $(VENV)
1111
$(BIN)/pip install -r requirements-dev.txt
1212
$(BIN)/pre-commit install
13-
$(BIN)/pip install -e .[snowflake,bigquery]
13+
$(BIN)/pip install -e .[snowflake,bigquery,databricks]
1414

1515
lint:
1616
$(BIN)/black raster_loader setup.py

README.md

Lines changed: 180 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -18,19 +18,76 @@ The Raster Loader documentation is available at [raster-loader.readthedocs.io](h
1818

1919
```bash
2020
pip install -U raster-loader
21-
22-
pip install -U raster-loader"[bigquery]"
23-
pip install -U raster-loader"[snowflake]"
2421
```
2522

26-
### Installing from source
23+
To install from source:
2724

2825
```bash
2926
git clone https://github.com/cartodb/raster-loader
3027
cd raster-loader
31-
pip install .
28+
pip install -U .
29+
```
30+
31+
> **Tip**: In most cases, it is recommended to install Raster Loader in a virtual environment. Use [venv](https://docs.python.org/3/library/venv.html) to create and manage your virtual environment.
32+
33+
The above will install the dependencies required to work with all cloud providers (BigQuery, Snowflake, Databricks). If you only want to work with one of them, you can install the dependencies for each separately:
34+
35+
```bash
36+
pip install -U raster-loader[bigquery]
37+
pip install -U raster-loader[snowflake]
38+
pip install -U raster-loader[databricks]
39+
```
40+
41+
For Databricks, you will also need to install the [databricks-connect](https://pypi.org/project/databricks-connect/) package corresponding to your Databricks Runtime Version. For example, if your cluster uses DBR 15.1, install:
42+
43+
```bash
44+
pip install databricks-connect==15.1
45+
```
46+
47+
You can find your cluster's DBR version in the Databricks UI under Compute > Your Cluster > Configuration > Databricks Runtime version.
48+
Or you can run the following SQL query from your cluster:
49+
50+
```sql
51+
SELECT current_version();
52+
```
53+
54+
To verify the installation was successful, run:
55+
56+
```bash
57+
carto info
3258
```
3359

60+
This command will display system information including the installed Raster Loader version.
61+
62+
## Prerequisites
63+
64+
Before using Raster Loader with each platform, you need to have the following set up:
65+
66+
**BigQuery:**
67+
- A [GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
68+
- A [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets-intro)
69+
- The `GOOGLE_APPLICATION_CREDENTIALS` environment variable set to the path of a JSON file containing your BigQuery credentials. See the [GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-key) for more information.
70+
71+
**Snowflake:**
72+
- A Snowflake account
73+
- A Snowflake database
74+
- A Snowflake schema
75+
76+
**Databricks:**
77+
- A [Databricks server hostname](https://docs.databricks.com/aws/en/integrations/compute-details)
78+
- A [Databricks cluster id](https://learn.microsoft.com/en-us/azure/databricks/workspace/workspace-details#cluster-url)
79+
- A [Databricks token](https://docs.databricks.com/aws/en/dev-tools/auth/pat)
80+
81+
**Raster files**
82+
83+
The input raster must be a `GoogleMapsCompatible` raster. You can make your raster compatible by converting it with the following GDAL command:
84+
85+
```bash
86+
gdalwarp -of COG -co TILING_SCHEME=GoogleMapsCompatible -co COMPRESS=DEFLATE -co OVERVIEWS=IGNORE_EXISTING -co ADD_ALPHA=NO -co RESAMPLING=NEAREST -co BLOCKSIZE=512 <input_raster>.tif <output_raster>.tif
87+
```
88+
89+
Your raster file must be in a format that can be [read by GDAL](https://gdal.org/drivers/raster/index.html) and processed with [rasterio](https://rasterio.readthedocs.io/en/latest/).
90+
3491
## Usage
3592

3693
There are two ways you can use Raster Loader:
@@ -42,105 +99,164 @@ There are two ways you can use Raster Loader:
4299

43100
After installing Raster Loader, you can run the CLI by typing `carto` in your terminal.
44101

45-
Currently, Raster Loader supports uploading raster data to [BigQuery](https://cloud.google.com/bigquery).
46-
Accessing BigQuery with Raster Loader requires the
47-
`GOOGLE_APPLICATION_CREDENTIALS` environment variable to be set to the path of a JSON
48-
file containing your BigQuery credentials. See the
49-
[GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-key)
50-
for more information.
51-
52-
Two commands are available:
102+
Currently, Raster Loader allows you to upload a local raster file to BigQuery, Snowflake, or Databricks tables. You can also download and inspect raster files from these platforms.
53103

54-
#### Uploading to BigQuery
104+
#### Uploading Raster Data
55105

56-
`carto bigquery upload` loads raster data from a local file to a BigQuery table.
57-
At a minimum, the `carto bigquery upload` command requires a `file_path` to a local
58-
raster file that can be [read by GDAL](https://gdal.org/drivers/raster/index.html) and processed with [rasterio](https://rasterio.readthedocs.io/en/latest/). It also requires
59-
the `project` (the [GCP project name](https://cloud.google.com/resource-manager/docs/creating-managing-projects))
60-
and `dataset` (the [BigQuery dataset name](https://cloud.google.com/bigquery/docs/datasets-intro))
61-
parameters. There are also additional parameters, such as `table` ([BigQuery table
62-
name](https://cloud.google.com/bigquery/docs/tables-intro)) and `overwrite` (to
63-
overwrite existing data).
64-
65-
For example:
66-
67-
``` bash
106+
Examples for each platform:
68107

108+
**BigQuery:**
109+
```bash
69110
carto bigquery upload \
70111
--file_path /path/to/my/raster/file.tif \
71112
--project my-gcp-project \
72113
--dataset my-bigquery-dataset \
73114
--table my-bigquery-table \
74115
--overwrite
116+
```
75117

118+
**Snowflake:**
119+
```bash
120+
carto snowflake upload \
121+
--file_path /path/to/my/raster/file.tif \
122+
--database my-snowflake-database \
123+
--schema my-snowflake-schema \
124+
--table my-snowflake-table \
125+
--account my-snowflake-account \
126+
--username my-snowflake-user \
127+
--password my-snowflake-password \
128+
--overwrite
76129
```
77130

78-
This command uploads the TIFF file from `/path/to/my/raster/file.tif` to a BigQuery
79-
project named `my-gcp-project`, a dataset named `my-bigquery-dataset`, and a table
80-
named `my-bigquery-table`. If the table already contains data, this data will be
81-
overwritten because the `--overwrite` flag is set.
131+
Note that authentication parameters are explicitly required since they are not set up in the environment.
82132

83-
#### Inspecting a raster file on BigQuery
133+
**Databricks:**
134+
```bash
135+
carto databricks upload \
136+
--file_path /path/to/my/raster/file.tif \
137+
--catalog my-databricks-catalog \
138+
--schema my-databricks-schema \
139+
--table my-databricks-table \
140+
--server-hostname my-databricks-server-hostname \
141+
--cluster-id my-databricks-cluster-id \
142+
--token my-databricks-token \
143+
--overwrite
144+
```
84145

85-
Use the `carto bigquery describe` command to retrieve information about a raster file
86-
stored in a BigQuery table.
146+
Note that authentication parameters are explicitly required since they are not set up in the environment.
87147

88-
At a minimum, this command requires a
89-
[GCP project name](https://cloud.google.com/resource-manager/docs/creating-managing-projects),
90-
a [BigQuery dataset name](https://cloud.google.com/bigquery/docs/datasets-intro), and a
91-
[BigQuery table name](https://cloud.google.com/bigquery/docs/tables-intro).
148+
Additional features include:
149+
- Specifying bands with `--band` and `--band_name`
150+
- Enabling compression with `--compress` and `--compression-level`
151+
- Chunking large uploads with `--chunk_size`
92152

93-
For example:
153+
#### Inspecting Raster Data
94154

95-
``` bash
155+
To inspect a raster file stored in any platform, use the `describe` command:
156+
157+
**BigQuery:**
158+
```bash
96159
carto bigquery describe \
97160
--project my-gcp-project \
98161
--dataset my-bigquery-dataset \
99162
--table my-bigquery-table
100163
```
101164

165+
**Snowflake:**
166+
```bash
167+
carto snowflake describe \
168+
--database my-snowflake-database \
169+
--schema my-snowflake-schema \
170+
--table my-snowflake-table \
171+
--account my-snowflake-account \
172+
--username my-snowflake-user \
173+
--password my-snowflake-password
174+
```
175+
176+
Note that authentication parameters are explicitly required since they are not set up in the environment.
177+
178+
**Databricks:**
179+
```bash
180+
carto databricks describe \
181+
--catalog my-databricks-catalog \
182+
--schema my-databricks-schema \
183+
--table my-databricks-table \
184+
--server-hostname my-databricks-server-hostname \
185+
--cluster-id my-databricks-cluster-id \
186+
--token my-databricks-token
187+
```
188+
189+
Note that authentication parameters are explicitly required since they are not set up in the environment.
190+
191+
For a complete list of options and commands, run `carto --help` or see the [full documentation](https://raster-loader.readthedocs.io/en/latest/user_guide/cli.html).
192+
102193
### Using Raster Loader as a Python library
103194

104-
After installing Raster Loader, you can import the package into your Python project. For
105-
example:
195+
After installing Raster Loader, you can use it in your Python project.
196+
197+
First, import the corresponding connection class for your platform:
198+
199+
```python
200+
# For BigQuery
201+
from raster_loader import BigQueryConnection
202+
203+
# For Snowflake
204+
from raster_loader import SnowflakeConnection
205+
206+
# For Databricks
207+
from raster_loader import DatabricksConnection
208+
```
209+
210+
Then, create a connection object with the appropriate parameters:
211+
212+
```python
213+
# For BigQuery
214+
connection = BigQueryConnection('my-project')
106215

107-
``` python
108-
from raster_loader import rasterio_to_bigquery, bigquery_to_records
216+
# For Snowflake
217+
connection = SnowflakeConnection('my-user', 'my-password', 'my-account', 'my-database', 'my-schema')
218+
219+
# For Databricks
220+
connection = DatabricksConnection('my-server-hostname', 'my-token', 'my-cluster-id')
109221
```
110222

111-
Currently, Raster Loader supports uploading raster data to [BigQuery](https://cloud.google.com/bigquery). Accessing BigQuery with Raster Loader requires the
112-
`GOOGLE_APPLICATION_CREDENTIALS` environment variable to be set to the path of a JSON
113-
file containing your BigQuery credentials. See the
114-
[GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-key)
115-
for more information.
223+
#### Uploading a raster file
116224

117-
You can use Raster Loader to upload a local raster file to an existing
118-
BigQuery table using the `rasterio_to_bigquery()` function:
225+
To upload a raster file, use the `upload_raster` function:
119226

120-
``` python
121-
rasterio_to_bigquery(
227+
```python
228+
connection.upload_raster(
122229
file_path = 'path/to/raster.tif',
123-
project_id = 'my-project',
124-
dataset_id = 'my_dataset',
125-
table_id = 'my_table',
230+
fqn = 'database.schema.tablename'
126231
)
127232
```
128233

129234
This function returns `True` if the upload was successful.
130235

131-
You can also access and inspect a raster file from a BigQuery table using the
132-
`bigquery_to_records()` function:
236+
You can enable compression of the band data to reduce storage size:
237+
238+
```python
239+
connection.upload_raster(
240+
file_path = 'path/to/raster.tif',
241+
fqn = 'database.schema.tablename',
242+
compress = True, # Enable gzip compression of band data
243+
compression_level = 3 # Optional: Set compression level (1-9, default=6)
244+
)
245+
```
246+
247+
#### Inspecting a raster file
133248

134-
``` python
135-
records_df = bigquery_to_records(
136-
project_id = 'my-project',
137-
dataset_id = 'my_dataset',
138-
table_id = 'my_table',
249+
To access and inspect a raster file stored in any platform, use the `get_records` function:
250+
251+
```python
252+
records = connection.get_records(
253+
fqn = 'database.schema.tablename'
139254
)
140255
```
141256

142-
This function returns a DataFrame with some samples from the raster table on BigQuery
143-
(10 rows by default).
257+
This function returns a DataFrame with some samples from the raster table (10 rows by default).
258+
259+
For more details, see the [full documentation](https://raster-loader.readthedocs.io/en/latest/user_guide/use_with_python.html).
144260

145261
## Development
146262

0 commit comments

Comments
 (0)