Skip to content

default fill_value not masked when read from file #7723

Closed
@kmuehlbauer

Description

@kmuehlbauer

What happened?

When reading a netcdf file wich has been created with fill_value=None (default) those data is not masked. If one is writing back to disk this manifests.

What did you expect to happen?

Values should be masked.

There seems to be a simple solution:

On read apply the netcdf default fill_value in the variables attributes before decoding if no _FillValue attribute is set. After decoding we could change that to np.nan for floating point types.

Minimal Complete Verifiable Example

import numpy as np
import netCDF4 as nc
import xarray as xr

with nc.Dataset("test-no-missing-01.nc", mode="w") as ds:
    x = ds.createDimension("x", 5)
    test = ds.createVariable("test", "f4", ("x",), fill_value=None)
    test[:4] = np.array([0.0, np.nan, 1.0, 8.0], dtype="f4")
with nc.Dataset("test-no-missing-01.nc") as ds:
    print(ds["test"])
    print(ds["test"][:])

with xr.open_dataset("test-no-missing-01.nc").load() as roundtrip:
    print(roundtrip)
    print(roundtrip["test"].attrs)
    print(roundtrip["test"].encoding)
    roundtrip.to_netcdf("test-no-missing-02.nc")
with nc.Dataset("test-no-missing-02.nc") as ds:
    print(ds["test"])
    print(ds["test"][:])

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

<class 'netCDF4._netCDF4.Variable'>
float32 test(x)
unlimited dimensions: 
current shape = (5,)
filling on, default _FillValue of 9.969209968386869e+36 used
[0.0 nan 1.0 8.0 --]

<xarray.Dataset>
Dimensions:  (x: 5)
Dimensions without coordinates: x
Data variables:
    test     (x) float32 0.0 nan 1.0 8.0 9.969e+36
{}
{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': True, 'chunksizes': None, 'source': 'test-no-missing-01.nc', 'original_shape': (5,), 'dtype': dtype('float32')}
<class 'netCDF4._netCDF4.Variable'>
float32 test(x)
    _FillValue: nan
unlimited dimensions: 
current shape = (5,)
filling on
[0.0 -- 1.0 8.0 9.969209968386869e+36]

Anything else we need to know?

The issue is similar to #7722 but is more intricate, as now the status of certain data values change from masked to some netcdf specific default value.

This is when only parts of the source dataset have been written to. Then the default fill_value get's delivered to the user but it is not backed by an _FillValue attribute.

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.21-150400.24.55-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.0
libnetcdf: 4.9.2

xarray: 2023.3.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.1
netCDF4: 1.6.3
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2023.3.1
distributed: 2023.3.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: 11.6.0
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.6.0
pip: 23.0.1
conda: None
pytest: 7.2.2
mypy: None
IPython: 8.11.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions