Description
Code Sample :
Considering a netcdf file file with the following variable:
short agc_40hz(time, meas_ind) ;
agc_40hz:_FillValue = 32767s ;
agc_40hz:units = "dB" ;
agc_40hz:scale_factor = 0.01 ;
Code:
from netCDF4 import Dataset
import xarray as xr
d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc")
b = x['agc_40hz'].values.flatten()[69] ## 21.939998626708984 'numpy.float32'
abs(a - b) # 0.000001373291017
Problem description :
Different behaviour of xarray comparing to netCDF4 Dataset
When reading the dataset with xarray we found that the decoded type was numpy.float32 instead of numpy.float64
This netcdf variable has an int16 dtype
when the variable is read with the netCDF4 library directly, it is automatically converted to numpy.float64.
in our case we loose on precision when using xarray.
We found two solutions for this:
First solution :
This solution aims to prevent auto_maskandscale
d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc", mask_and_scale=False, decode_times=False)
b = x['agc_40hz'].values.flatten()[69] ## 21.940000000000001 'numpy.float64'
abs(a - b) # 0.000000000000000
Modification in xarray/backends/netCDF4_.py line 241
def _disable_auto_decode_variable(var):
"""Disable automatic decoding on a netCDF4.Variable.
We handle these types of decoding ourselves.
"""
pass
# var.set_auto_maskandscale(False)
# # only added in netCDF4-python v1.2.8
# with suppress(AttributeError):
# var.set_auto_chartostring(False)
Second solution :
This solution uses numpy.float64 whatever integer type provided.
d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc")
b = x['agc_40hz'].values.flatten()[69] ## 21.940000000000001 'numpy.float64'
abs(a - b) # 0.000000000000000
Modification in xarray/core/dtypes.py line 85
def maybe_promote(dtype):
"""Simpler equivalent of pandas.core.common._maybe_promote
Parameters
----------
dtype : np.dtype
Returns
-------
dtype : Promoted dtype that can hold missing values.
fill_value : Valid missing value for the promoted dtype.
"""
# N.B. these casting rules should match pandas
if np.issubdtype(dtype, np.floating):
fill_value = np.nan
elif np.issubdtype(dtype, np.integer):
#########################
#OLD CODE BEGIN
#########################
# if dtype.itemsize <= 2:
# dtype = np.float32
# else:
# dtype = np.float64
#########################
#OLD CODE END
#########################
#########################
#NEW CODE BEGIN
#########################
dtype = np.float64 # whether it's int16 or int32 we use float64
#########################
#NEW CODE END
#########################
fill_value = np.nan
elif np.issubdtype(dtype, np.complexfloating):
fill_value = np.nan + np.nan * 1j
elif np.issubdtype(dtype, np.datetime64):
fill_value = np.datetime64('NaT')
elif np.issubdtype(dtype, np.timedelta64):
fill_value = np.timedelta64('NaT')
else:
dtype = object
fill_value = np.nan
return np.dtype(dtype), fill_value
Solution number 2 would be great for us.
At this point we don't know if this modification would introduce some side effects.
Is there another way to avoid this problem ?
Expected Output
In our case we expect the variable to be in numpy.float64 as it is done by netCDF4.
Output of xr.show_versions()
xarray: 0.10.8
pandas: 0.23.3
numpy: 1.15.0rc2
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.18.1
distributed: 1.22.0
matplotlib: 2.2.2
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: 3.6.3
IPython: 6.4.0
sphinx: None