Closed
Description
I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports.
I have two problems:
- The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. 😫
The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip. - I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to
dask.compute()
to allow the dask scheduler to de-duplicate the nodes.
To solve both problems, I wrote a new function:
http://xarray-extras.readthedocs.io/en/latest/api/csv.html
And now my high level wrapper code looks like this:
# DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension
nc = xarray.open_mfdataset('inputs.*.nc')
reports = [
# DataArrays with shape (500000, 2000), with the rows split in 200 chunks
gen_report0(nc),
gen_report1(nc),
....
gen_report39(nc),
]
futures = [
# dask.delayed objects
to_csv(reports[0], 'report0.csv.gz', compression='gzip'),
to_csv(reports[1], 'report1.csv.gz', compression='gzip'),
....
to_csv(reports[39], 'report39.csv.gz', compression='gzip'),
]
dask.compute(*futures)
The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.
Opinions?