Skip to content

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Closed
@agoodm

Description

@agoodm

Code Sample, a copy-pastable example if possible

Data file used is here: test.nc.zip
Output from each statement is commented out.

import xarray as xr
ds = xr.open_dataset('test.nc')
ds.cold_rad_cnts.min()
#13038.
ds.cold_rad_cnts.max()
#13143.
ds.cold_rad_cnts.mean()
#12640.583984
ds.cold_rad_cnts.std()
#455.035156
ds.cold_rad_cnts.sum()
#4.472997e+10

Problem description

As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's nansum() on data with a float32 dtype. I demonstrated this effect here: pydata/bottleneck#193.

Naturally, this means that converting the data to float64 or any int dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below.

Expected Output

In [8]: import numpy as np

In [9]: np.nansum(ds.cold_rad_cnts)
Out[9]: 46357123000.0

In [10]: np.nanmean(ds.cold_rad_cnts)
Out[10]: 13100.413

In [11]: np.nanstd(ds.cold_rad_cnts)
Out[11]: 8.158843

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.0
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: None
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.5.0
sphinx: None

Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's float32. I am thinking making float64 the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck's nansum() is used on float32 arrays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions