DataArray.to_csv()

I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports.
I have two problems:

1. The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. :tired_face:
The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip.
2. I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to ``dask.compute()`` to allow the dask scheduler to de-duplicate the nodes.

To solve both problems, I wrote a new function:
http://xarray-extras.readthedocs.io/en/latest/api/csv.html

And now my high level wrapper code looks like this:
```
# DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension
nc = xarray.open_mfdataset('inputs.*.nc')
reports = [
    # DataArrays with shape (500000, 2000), with the rows split in 200 chunks
    gen_report0(nc),
    gen_report1(nc),
    ....
    gen_report39(nc),
]
futures = [
    # dask.delayed objects
    to_csv(reports[0], 'report0.csv.gz', compression='gzip'),
    to_csv(reports[1], 'report1.csv.gz', compression='gzip'),
    ....
    to_csv(reports[39], 'report39.csv.gz', compression='gzip'),
]
dask.compute(*futures)
```
The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.

Opinions?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataArray.to_csv() #2289

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DataArray.to_csv() #2289

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions