Skip to content

Dataset.encoding and unlimited dimensions for to_netcdf #1170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 24, 2017
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
271a751
initial hack at enabling unlimited dims in to_netcdf
Dec 17, 2016
bedad43
unlimited dims for netcdf4, still working on scipy
Dec 20, 2016
c797511
fix two bugs in h5netcdf tests
Dec 20, 2016
affac00
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Dec 23, 2016
ca60729
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Dec 23, 2016
3d24610
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Dec 27, 2016
e794165
fix failing tests, try workaround for scipy/scipy#6880
Dec 27, 2016
2ba6688
cleanup
Dec 27, 2016
b7bd0b8
simple slice in scipy workaround
Dec 27, 2016
fdbd55d
initial fixes after @shoyer's review
Dec 28, 2016
47442e6
fix failing test by passing unlimited_dims through to in memory store
Dec 28, 2016
2df224c
remove encoding from dataset constructor
Dec 28, 2016
eead3e4
more tests for unlimited_dims and update whats-new
Dec 28, 2016
fac2f89
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Jan 19, 2017
33dd062
refactor unlimited dimensions / dataset encoding to avoid using DataS…
Jan 19, 2017
65df346
raise user warning if unlimited dims is used with h5netcdf
Jan 19, 2017
b076c15
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Jan 22, 2017
db964a1
cleanup backends after unlimited_dims changes
Jan 23, 2017
cb22ba1
Merge branch 'master' of github.com:pydata/xarray into feature/unlimi…
Jan 23, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Attributes
Dataset.data_vars
Dataset.coords
Dataset.attrs
Dataset.encoding
Dataset.indexes
Dataset.get_index

Expand Down
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,10 @@ Enhancements
similar to what the command line utility ``ncdump -h`` produces (:issue:`1150`).
By `Joe Hamman <https://github.com/jhamman>`_.

- Added the ability write unlimited netCDF dimensions with the ``netcdf4``
backend.
By `Joe Hamman <https://github.com/jhamman>`_.

Bug fixes
~~~~~~~~~
- ``groupby_bins`` now restores empty bins by default (:issue:`1019`).
Expand Down
2 changes: 2 additions & 0 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,8 @@ def to_netcdf(dataset, path=None, mode='w', format=None, group=None,
sync = writer is None

store = store_cls(path, mode, format, group, writer)
# Copy dataset encoding to datastore
store.encoding = dataset.encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever actually use this encoding state on the datastore? If not, let's not bother setting it. I think everything necessary ends up being passed on via set_variables.

Note that as bunch as possible, I've tried to make DataStore itself stateless, only storing state in the file-like object it points to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were using this but I've refactored to avoid it.

try:
dataset.dump_to_store(store, sync=sync, encoding=encoding)
if isinstance(path, BytesIO):
Expand Down
21 changes: 15 additions & 6 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from __future__ import division
from __future__ import print_function
import numpy as np
import itertools
import logging
import time
import traceback
Expand All @@ -12,7 +11,7 @@

from ..conventions import cf_encoder
from ..core.utils import FrozenOrderedDict
from ..core.pycompat import iteritems, dask_array_type, OrderedDict
from ..core.pycompat import iteritems, dask_array_type

# Create a logger object, but don't add any handlers. Leave that to user code.
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -75,6 +74,9 @@ def get_attrs(self): # pragma: no cover
def get_variables(self): # pragma: no cover
raise NotImplementedError

def get_encoding(self):
return {}

def load(self):
"""
This loads the variables and attributes simultaneously.
Expand All @@ -96,8 +98,9 @@ def load(self):
This function will be called anytime variables or attributes
are requested, so care should be taken to make sure its fast.
"""
variables = FrozenOrderedDict((_decode_variable_name(k), v)
for k, v in iteritems(self.get_variables()))
self.encoding = self.get_encoding()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little dangerous -- .load() needs to be called in order to guarantee a consistent encoding state on a DataStore. I would rather we didn't set such state, and simply pulled this information out of the file linked to the DataStore as necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I've removed the encoding attribute on the DataStore.

variables = FrozenOrderedDict((_decode_variable_name(k), v) for k, v in
iteritems(self.get_variables()))
attributes = FrozenOrderedDict(self.get_attrs())
return variables, attributes

Expand Down Expand Up @@ -143,7 +146,11 @@ def add(self, source, target):
self.sources.append(source)
self.targets.append(target)
else:
target[...] = source
try:
target[...] = source
except TypeError:
# workaround for GH: scipy/scipy#6880
target[:] = source

def sync(self):
if self.sources:
Expand Down Expand Up @@ -197,9 +204,11 @@ def set_variables(self, variables, check_encoding_set):
target, source = self.prepare_variable(name, v, check)
self.writer.add(source, target)

def set_necessary_dimensions(self, variable):
def set_necessary_dimensions(self, variable, unlimited_dims=set()):
for d, l in zip(variable.dims, variable.shape):
if d not in self.dimensions:
if d in unlimited_dims:
l = None
self.set_dimension(d, l)


Expand Down
15 changes: 11 additions & 4 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,16 @@
from __future__ import division
from __future__ import print_function
import functools
import warnings

from .. import Variable
from ..core import indexing
from ..core.utils import FrozenOrderedDict, close_on_error, Frozen
from ..core.pycompat import iteritems, bytes_type, unicode_type, OrderedDict

from .common import WritableCFDataStore, DataStorePickleMixin
from .netCDF4_ import (_nc4_group, _nc4_values_and_dtype, _extract_nc4_encoding,
BaseNetCDF4Array)
from .netCDF4_ import (_nc4_group, _nc4_values_and_dtype,
_extract_nc4_variable_encoding, BaseNetCDF4Array)


def maybe_decode_bytes(txt):
Expand All @@ -33,7 +34,7 @@ def _read_attributes(h5netcdf_var):
return attrs


_extract_h5nc_encoding = functools.partial(_extract_nc4_encoding,
_extract_h5nc_encoding = functools.partial(_extract_nc4_variable_encoding,
lsd_okay=False, backend='h5netcdf')


Expand All @@ -58,6 +59,7 @@ def __init__(self, filename, mode='r', format=None, group=None,
self._opener = opener
self._filename = filename
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still should go away :)

super(H5NetCDFStore, self).__init__(writer)

def open_store_variable(self, name, var):
Expand Down Expand Up @@ -100,7 +102,12 @@ def prepare_variable(self, name, variable, check_encoding=False):
if dtype is str:
dtype = h5py.special_dtype(vlen=unicode_type)

self.set_necessary_dimensions(variable)
unlimited_dims = self.encoding.get('unlimited_dims', set())
if len(unlimited_dims) > 0:
warnings.warn('h5netcdf does not support unlimited dimensions',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If check_encoding is True, this should raise an error, not just a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, check_encoding is specific to variable encoding. Raising an error would make sense if you set unlimited_dims via an argument in to_netcdf (which is not yet possible).

UserWarning)
unlimited_dims = set()
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

fill_value = attrs.pop('_FillValue', None)
if fill_value in ['\x00']:
Expand Down
1 change: 1 addition & 0 deletions xarray/backends/memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ class InMemoryDataStore(AbstractWritableDataStore):
def __init__(self, variables=None, attributes=None, writer=None):
self._variables = OrderedDict() if variables is None else variables
self._attributes = OrderedDict() if attributes is None else attributes
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

super(InMemoryDataStore, self).__init__(writer)

def get_attrs(self):
Expand Down
29 changes: 18 additions & 11 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import numpy as np

from .. import Variable
from ..conventions import pop_to, cf_encoder
from ..conventions import pop_to
from ..core import indexing
from ..core.utils import (FrozenOrderedDict, NDArrayMixin,
close_on_error, is_remote_uri)
Expand Down Expand Up @@ -138,13 +138,13 @@ def _force_native_endianness(var):
# check to see if encoding has a value for endian its 'native'
if not var.encoding.get('endian', 'native') is 'native':
raise NotImplementedError("Attempt to write non-native endian type, "
"this is not supported by the netCDF4 python "
"library.")
"this is not supported by the netCDF4 "
"python library.")
return var


def _extract_nc4_encoding(variable, raise_on_invalid=False, lsd_okay=True,
backend='netCDF4'):
def _extract_nc4_variable_encoding(variable, raise_on_invalid=False,
lsd_okay=True, backend='netCDF4'):
encoding = variable.encoding.copy()

safe_to_drop = set(['source', 'original_shape'])
Expand All @@ -154,9 +154,8 @@ def _extract_nc4_encoding(variable, raise_on_invalid=False, lsd_okay=True,
valid_encodings.add('least_significant_digit')

if (encoding.get('chunksizes') is not None and
(encoding.get('original_shape', variable.shape)
!= variable.shape) and
not raise_on_invalid):
(encoding.get('original_shape', variable.shape) !=
variable.shape) and not raise_on_invalid):
del encoding['chunksizes']

for k in safe_to_drop:
Expand Down Expand Up @@ -209,6 +208,7 @@ def __init__(self, filename, mode='r', format='NETCDF4', group=None,
self._opener = opener
self._filename = filename
self._mode = 'a' if mode == 'w' else mode
self.encoding = {}
super(NetCDF4DataStore, self).__init__(writer)

def open_store_variable(self, name, var):
Expand Down Expand Up @@ -251,6 +251,12 @@ def get_dimensions(self):
return FrozenOrderedDict((k, len(v))
for k, v in iteritems(self.ds.dimensions))

def get_encoding(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean slightly toward just creating a get_unlimited_dims method rather than get_encoding, unless we can think of other Dataset wide encodings we might possibly add in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other encoding value that comes to mind is the dataset format (e.g. NETCDF4 vs. NETCDF3). Maybe there are others as well but nothing is mind.

encoding = {}
encoding['unlimited_dims'] = set(
[k for k, v in self.ds.dimensions.items() if v.isunlimited()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a set comprehension here, e.g.,

encoding['unlimited_dims'] = {k for k, v in self.ds.dimensions.items()
                              if v.isunlimited()}

return encoding

def set_dimension(self, name, length):
self.ds.createDimension(name, size=length)

Expand All @@ -270,16 +276,17 @@ def prepare_variable(self, name, variable, check_encoding=False):
variable = encode_nc3_variable(variable)
datatype = variable.dtype

self.set_necessary_dimensions(variable)
unlimited_dims = self.encoding.get('unlimited_dims', set())
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

fill_value = attrs.pop('_FillValue', None)
if fill_value in ['', '\x00']:
# these are equivalent to the default FillValue, but netCDF4
# doesn't like setting fill_value to an empty string
fill_value = None

encoding = _extract_nc4_encoding(variable,
raise_on_invalid=check_encoding)
encoding = _extract_nc4_variable_encoding(
variable, raise_on_invalid=check_encoding)
nc4_var = self.ds.createVariable(
varname=name,
datatype=datatype,
Expand Down
1 change: 1 addition & 0 deletions xarray/backends/pydap_.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ class PydapDataStore(AbstractDataStore):
def __init__(self, url):
import pydap.client
self.ds = pydap.client.open_url(url)
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete


def open_store_variable(self, var):
data = indexing.LazilyIndexedArray(PydapArrayWrapper(var))
Expand Down
7 changes: 7 additions & 0 deletions xarray/backends/pynio_.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ def __init__(self, filename, mode='r'):
self.ds = opener()
self._opener = opener
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete


def open_store_variable(self, name, var):
data = indexing.LazilyIndexedArray(NioArrayWrapper(name, self))
Expand All @@ -57,5 +58,11 @@ def get_attrs(self):
def get_dimensions(self):
return Frozen(self.ds.dimensions)

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k in self.ds.dimensions if self.ds.unlimited(k)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dap can represent unlimited dimensions:
http://docs.opendap.org/index.php/DAP4:_Specification_Volume_1#Dimensions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this is pynio which does: https://www.pyngl.ucar.edu/whatsnew.shtml#Version1.4.1

return encoding

def close(self):
self.ds.close()
20 changes: 18 additions & 2 deletions xarray/backends/scipy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import warnings

from .. import Variable
from ..core.pycompat import iteritems, basestring, OrderedDict
from ..core.pycompat import iteritems, OrderedDict
from ..core.utils import Frozen, FrozenOrderedDict
from ..core.indexing import NumpyIndexingAdapter

Expand Down Expand Up @@ -102,6 +102,7 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None,
self.ds = opener()
self._opener = opener
self._mode = mode
self.encoding = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete


super(ScipyDataStore, self).__init__(writer)

Expand All @@ -116,9 +117,19 @@ def get_variables(self):
def get_attrs(self):
return Frozen(_decode_attrs(self.ds._attributes))

def _get_unlimited_dimensions(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you use this method anymore

return set(k for k, v in iteritems(self.ds.dimensions) if v is None)

def get_dimensions(self):
self._unlimited_dimensions = self._get_unlimited_dimensions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't use this currently

return Frozen(self.ds.dimensions)

def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k, v in self.ds.dimensions.items() if v is None])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use the same set comprehension you switched to in netCDF4_.py

return encoding

def set_dimension(self, name, length):
if name in self.dimensions:
raise ValueError('%s does not support modifying dimensions'
Expand All @@ -140,7 +151,12 @@ def prepare_variable(self, name, variable, check_encoding=False):
raise ValueError('unexpected encoding for scipy backend: %r'
% list(variable.encoding))

self.set_necessary_dimensions(variable)
unlimited_dims = self.encoding.get('unlimited_dims', set())

if len(unlimited_dims) > 1:
raise ValueError('NETCDF3 only supports one unlimited dimension')
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

data = variable.data
# nb. this still creates a numpy array in all memory, even though we
# don't write the data yet; scipy.io.netcdf does not not support
Expand Down
2 changes: 2 additions & 0 deletions xarray/conventions.py
Original file line number Diff line number Diff line change
Expand Up @@ -950,6 +950,8 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
ds = Dataset(vars, attrs=attrs)
ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars))
ds._file_obj = file_obj
ds.encoding = obj.encoding

return ds


Expand Down
6 changes: 3 additions & 3 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@
import numpy as np
import pandas as pd

from .pycompat import (basestring, iteritems, suppress, dask_array_type,
OrderedDict)
from .pycompat import (basestring, suppress, dask_array_type, OrderedDict)
from . import formatting
from .utils import SortedKeysDict, not_implemented, Frozen

Expand Down Expand Up @@ -751,7 +750,8 @@ def full_like(other, fill_value, dtype=None):
elif isinstance(other, DataArray):
return DataArray(
_full_like_variable(other.variable, fill_value, dtype),
dims=other.dims, coords=other.coords, attrs=other.attrs, name=other.name)
dims=other.dims, coords=other.coords, attrs=other.attrs,
name=other.name)
elif isinstance(other, Variable):
return _full_like_variable(other, fill_value, dtype)
else:
Expand Down
Loading