Skip to content

DataArray.to_csv() #2289

Closed
Closed
@crusaderky

Description

@crusaderky

I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports.
I have two problems:

  1. The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. 😫
    The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip.
  2. I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to dask.compute() to allow the dask scheduler to de-duplicate the nodes.

To solve both problems, I wrote a new function:
http://xarray-extras.readthedocs.io/en/latest/api/csv.html

And now my high level wrapper code looks like this:

# DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension
nc = xarray.open_mfdataset('inputs.*.nc')
reports = [
    # DataArrays with shape (500000, 2000), with the rows split in 200 chunks
    gen_report0(nc),
    gen_report1(nc),
    ....
    gen_report39(nc),
]
futures = [
    # dask.delayed objects
    to_csv(reports[0], 'report0.csv.gz', compression='gzip'),
    to_csv(reports[1], 'report1.csv.gz', compression='gzip'),
    ....
    to_csv(reports[39], 'report39.csv.gz', compression='gzip'),
]
dask.compute(*futures)

The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.

Opinions?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions