Skip to content

CI Failure in Xarray test suite post-Dask tokenization update #8788

Closed
@andersy005

Description

@andersy005

What is your issue?

Recent changes in Dask's tokenization process (dask/dask#10876) seem to have introduced unexpected behavior in Xarray's test suite. This has led to CI failures, specifically in tests related to tokenization.

---------- coverage: platform linux, python 3.12.2-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED xarray/tests/test_dask.py::test_token_identical[obj0-<lambda>1] - AssertionError: assert 'bbd9679bdaf2...d3db65e29a72d' == '6352792990cf...e8004a9055314'
  
  - 6352792990cfe23adb7e8004a9055314
  + bbd9679bdaf284c371cd3db65e29a72d
FAILED xarray/tests/test_dask.py::test_token_identical[obj0-<lambda>2] - AssertionError: assert 'bbd9679bdaf2...d3db65e29a72d' == '6352792990cf...e8004a9055314'
  
  - 6352792990cfe23adb7e8004a9055314
  + bbd9679bdaf284c371cd3db65e29a72d
FAILED xarray/tests/test_dask.py::test_token_identical[obj1-<lambda>1] - AssertionError: assert 'c520b8516da8...0e9e0d02b79d0' == '9e2ab1c44990...6ac737226fa02'
  
  - 9e2ab1c44990adb4fb76ac737226fa02
  + c520b8516da8b6a98c10e9e0d02b79d0
FAILED xarray/tests/test_dask.py::test_token_identical[obj1-<lambda>2] - AssertionError: assert 'c520b8516da8...0e9e0d02b79d0' == '9e2ab1c44990...6ac737226fa02'
  
  - 9e2ab1c44990adb4fb76ac737226fa02
  + c520b8516da8b6a98c10e9e0d02b79d0
= 4 failed, 16293 passed, [628](https://github.com/pydata/xarray/actions/runs/8069874717/job/22045898877#step:9:629) skipped, 90 xfailed, 71 xpassed, 213 warnings in 472.07s (0:07:52) =
Error: Process completed with exit code 1.

previously, the following code snippet would pass, verifying the consistency of tokenization in Xarray objects:

In [1]: import xarray as xr, numpy as np

In [2]: def make_da():
   ...:     da = xr.DataArray(
   ...:         np.ones((10, 20)),
   ...:         dims=["x", "y"],
   ...:         coords={"x": np.arange(10), "y": np.arange(100, 120)},
   ...:         name="a",
   ...:     ).chunk({"x": 4, "y": 5})
   ...:     da.x.attrs["long_name"] = "x"
   ...:     da.attrs["test"] = "test"
   ...:     da.coords["c2"] = 0.5
   ...:     da.coords["ndcoord"] = da.x * 2
   ...:     da.coords["cxy"] = (da.x * da.y).chunk({"x": 4, "y": 5})
   ...: 
   ...:     return da
   ...: 

In [3]: da = make_da()

In [4]: import dask.base

In [5]: assert dask.base.tokenize(da) == dask.base.tokenize(da.copy(deep=False))

In [6]: assert dask.base.tokenize(da) == dask.base.tokenize(da.copy(deep=True))

In [9]: dask.__version__
Out[9]: '2023.3.0'

However, post-update in Dask version '2024.2.1', the same code fails:

In [55]: 
    ...: def make_da():
    ...:     da = xr.DataArray(
    ...:         np.ones((10, 20)),
    ...:         dims=["x", "y"],
    ...:         coords={"x": np.arange(10), "y": np.arange(100, 120)},
    ...:         name="a",
    ...:     ).chunk({"x": 4, "y": 5})
    ...:     da.x.attrs["long_name"] = "x"
    ...:     da.attrs["test"] = "test"
    ...:     da.coords["c2"] = 0.5
    ...:     da.coords["ndcoord"] = da.x * 2
    ...:     da.coords["cxy"] = (da.x * da.y).chunk({"x": 4, "y": 5})
    ...: 
    ...:     return da
    ...: 

In [56]: da = make_da()
In [57]: assert dask.base.tokenize(da) == dask.base.tokenize(da.copy(deep=False))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[57], line 1
----> 1 assert dask.base.tokenize(da) == dask.base.tokenize(da.copy(deep=False))

AssertionError: 

In [58]: dask.base.tokenize(da)
Out[58]: 'bbd9679bdaf284c371cd3db65e29a72d'

In [59]: dask.base.tokenize(da.copy(deep=False))
Out[59]: '6352792990cfe23adb7e8004a9055314'

In [61]: dask.__version__
Out[61]: '2024.2.1'

additionally, a deeper dive into dask.base.normalize_token() across the two Dask versions revealed that the latest version includes additional state or metadata in tokenization that was not present in earlier versions.

  • old version
In [29]: dask.base.normalize_token((type(da), da._variable, da._coords, da._name))
Out[29]: 
('tuple',
 [xarray.core.dataarray.DataArray,
  ('tuple',
   [xarray.core.variable.Variable,
    ('tuple', ['x', 'y']),
    'xarray-<this-array>-14cc91345e4b75c769b9032d473f6f6e',
    ('list', [('tuple', ['test', 'test'])])]),
  ('list',
   [('tuple',
     ['c2',
      ('tuple',
       [xarray.core.variable.Variable,
        ('tuple', []),
        (0.5, dtype('float64')),
        ('list', [])])]),
    ('tuple',
     ['cxy',
      ('tuple',
       [xarray.core.variable.Variable,
        ('tuple', ['x', 'y']),
        'xarray-<this-array>-8e98950eca22c69d304f0a48bc6c2df9',
        ('list', [])])]),
    ('tuple',
     ['ndcoord',
      ('tuple',
       [xarray.core.variable.Variable,
        ('tuple', ['x']),
        'xarray-ndcoord-82411ea5e080aa9b9f554554befc2f39',
        ('list', [])])]),
    ('tuple',
     ['x',
      ('tuple',
       [xarray.core.variable.IndexVariable,
        ('tuple', ['x']),
        ['x',
         ('603944b9792513fa0c686bb494a66d96c667f879',
          dtype('int64'),
          (10,),
          (8,))],
        ('list', [('tuple', ['long_name', 'x'])])])]),
    ('tuple',
     ['y',
      ('tuple',
       [xarray.core.variable.IndexVariable,
        ('tuple', ['y']),
        ['y',
         ('fc411db876ae0f4734dac8b64152d5c6526a537a',
          dtype('int64'),
          (20,),
          (8,))],
        ('list', [])])])]),
  'a'])
  • most recent version
In [44]: dask.base.normalize_token((type(da), da._variable, da._coords, da._name))
Out[44]: 
('tuple',
 [('7b61e7593a274e48', []),
  ('tuple',
   [('215b115b265c420c', []),
    ('tuple', ['x', 'y']),
    'xarray-<this-array>-980383b18aab94069bdb02e9e0956184',
    ('dict', [('tuple', ['test', 'test'])])]),
  ('dict',
   [('tuple',
     ['c2',
      ('tuple',
       [('__seen', 2),
        ('tuple', []),
        ('6825817183edbca7', ['48cb5e118059da42']),
        ('dict', [])])]),
    ('tuple',
     ['cxy',
      ('tuple',
       [('__seen', 2),
        ('tuple', ['x', 'y']),
        'xarray-<this-array>-6babb4e95665a53f34a3e337129d54b5',
        ('dict', [])])]),
    ('tuple',
     ['ndcoord',
      ('tuple',
       [('__seen', 2),
        ('tuple', ['x']),
        'xarray-ndcoord-8636fac37e5e6f4401eab2aef399f402',
        ('dict', [])])]),
    ('tuple',
     ['x',
      ('tuple',
       [('abc1995cae8530ae', []),
        ('tuple', ['x']),
        ['x', ('99b2df4006e7d28a', ['04673d65c892b5ba'])],
        ('dict', [('tuple', ['long_name', 'x'])])])]),
    ('tuple',
     ['y',
      ('tuple',
       [('__seen', 25),
        ('tuple', ['y']),
        ['y', ('88974ea603e15c49', ['a6c0f2053e85c87e'])],
        ('dict', [])])])]),
  'a'])

Cc @dcherian / @crusaderky for visibility

Metadata

Metadata

Assignees

Labels

CIContinuous Integration toolstopic-dask

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions