Skip to content

Internal refactor: create a generic function for applying ufuncs-like functions to xarray objects #770

Closed
@shoyer

Description

@shoyer

It would be awesome to have a generic function for making functions that act like NumPy's generalized universal functions "xarray aware".

What would xarray.apply_ufunc(func, objs, join='inner', agg_dims=None, drop_dims=None, kwargs=None) do?

  1. If one or more of the provided objects are Dataset or GroupBy instances, dispatch to specialized loops that call the remainder of apply_ufunc repeatedly.
  2. align all objects along shared labels using the indicated join (for some operations, e.g., where, a left join is appropriate rather than an inner join).
  3. broadcast all objects against each other to expand dimensionality along all dimensions except (optionally) those listed in agg_dims/drop_dims. drop_dims should be moved to the end, for consistency with gufunc signatures.
  4. Transform agg_dims (if provided) into an axis argument using get_axis_num and insert it into kwargs.
  5. Apply func to the data argument of each array to calculate the result using the provided kwargs. The result is expected to have all the same dimensions in the provided arrays, except any listed in the dims and drop_dims arguments.
  6. merge all coordinate data together (i.e., with an n-ary version of the Coordinate.merge method) and add these to the result array.

If any of args are not xarray objects (e.g., they're NumPy or dask arrays), they should be skipped in operations that don't apply to them. xarray.Variable don't align or have coordinates, for example.

A concrete example of similar functionality in dask.array is atop. The most similar thing to this that we currently have in xarray are the _unary_op and _binary_op staticmethods (e.g., on DataArray), but these only handle one or two arguments, don't handle aggregated dimensions and most importantly, are difficult to apply to new operations.

Here are a few concrete examples of how this could work:

def average(array, weights, dim=None):
    # still needs a bit of work to make a NaN and dask.array safe version
    # version of np.average 
    return apply_ufunc(np.average, [array, weights], agg_dims=dim)

def where(cond, first, second=None):
    if second is None:
        # need to write where2, a function that looks at first.dtype
        # to infer the appropriate NA sentinel value
        return apply_ufunc(ops.where2, [cond, first])
    else:
        return apply_ufunc(ops.where, [cond, first, second])

def dot(self, other, dim=None):
    if dim is None:
        dim = set(self.dims) ^ set(other.dims)
    return apply_ufunc(ops.tensordot, [self, other], agg_dims=dim)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions