Skip to content

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

Closed
@nbren12

Description

@nbren12

Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single "data matrix".

As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.

I have written some code in this gist, that have found pretty convenient for doing this. This gist has an XRReshaper class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset A(lat, lon, time) can be done like this

feature_dims = ['lat', 'lon']

rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)

# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)

eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)

I am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like

feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_

Another syntax which might be helpful is some kind of context manager approach like

with XRReshaper(A) as rs, data_matrix:
     # do some stuff with data_matrix
# use rs to restore output to a data array.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions