Skip to content

[WIP] Release 0.5 #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
.vscode/
.idea/
data/
docs/modules/
docs/_build/
docs/auto_examples/
coverage/
scratch

# So far, all html files are auto-generated
*.html
Expand Down
51 changes: 43 additions & 8 deletions docs/content/target.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ for supervised learning. This is achieved with a :py:class:`Target` object:

>>> from sklearn_xarray import wrap, Target
>>> from sklearn_xarray.datasets import load_digits_dataarray
>>> from sklearn.linear_model.logistic import LogisticRegression
>>> from sklearn.linear_model import LogisticRegression
>>>
>>> X = load_digits_dataarray()
>>> y = Target(coord='digit')(X)
Expand Down Expand Up @@ -61,17 +61,52 @@ Pre-processing
--------------

In some cases, it is necessary to pre-process the coordinate before it can be
used as a target. For this, the constructor takes a ``transform_func`` parameter
which can be used with the ``fit_transform`` method of transformers in
``sklearn.preprocessing`` (and also any other object implementing the sklearn
transformer interface):
used as a target. For this, the constructor takes a ``transformer`` parameter
which can be used with transformers in ``sklearn.preprocessing`` (and also any
other object implementing the sklearn transformer interface):

.. doctest::

>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.preprocessing import LabelBinarizer
>>>
>>> y = Target(coord='digit', transform_func=LabelBinarizer().fit_transform)(X)
>>> y = Target(coord='digit', transformer=LabelBinarizer(), reshapes="feature")
>>> wrapper = wrap(MLPClassifier(), reshapes="feature")
>>> wrapper.fit(X, y) # doctest:+ELLIPSIS
EstimatorWrapper(...)

This approach makes it possible to reverse the pre-processing, e.g. after
calling ``wrapper.predict``:

.. doctest::

>>> yp = wrapper.predict(X)
>>> yp
<xarray.DataArray (sample: 1797, feature: 10)>
array([[1, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 1, 0]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8
Dimensions without coordinates: feature
>>> y.inverse_transform(yp)
<xarray.DataArray (sample: 1797)>
array([0, 1, 2, ..., 8, 9, 8])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8


Alternatively, the constructor also accepts a ``transform_func`` parameter:

.. doctest::

>>> y = Target(coord='digit', transform_func=LabelBinarizer().fit_transform)
>>> wrapper = wrap(MLPClassifier())
>>> wrapper.fit(X, y) # doctest:+ELLIPSIS
EstimatorWrapper(...)
Expand All @@ -81,13 +116,13 @@ Indexing

A :py:class:`Target` object can be indexed in the same way as the underlying
coordinate and interfaces with ``numpy`` by providing an ``__array__``
attribute which returns ``numpy.array()`` of the (transformed) coordinate.
attribute which returns ``numpy.array()`` of the (transformed) data.


Multi-dimensional coordinates
-----------------------------

In some cases, the target coordinates span multiple dimensions, but the
In some cases, the target data spans multiple dimensions, but the
transformer expects a lower-dimensional input. With the ``dim`` parameter of
the :py:class:`Target` class you can specify which of the dimensions to keep.
You can also specify the callable ``reduce_func`` to perform the reduction of
Expand Down
104 changes: 87 additions & 17 deletions docs/content/transformers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,91 @@ xarray's powerful array manipulation syntax. Refer to :ref:`API/Pre-processing`
for a full list.


Combining dimensions
--------------------

scikit-learn's estimators generally assume that data is two-dimensional:
the first dimension represents the samples, the second dimension the features
of your data. Since xarray is generally used for higher-dimensional data, it is
often necessary to perform pre-processing steps that combine multiple
dimensions to a sample and/or feature dimension, or even combine multiple
variables of a ``Dataset`` into a single ``DataArray``.

.. py:currentmodule:: sklearn_xarray.datasets

For example, the :py:func:`load_digits_dataarray` method loads a
three-dimensional array of 8-by-8-pixel grayscale images:

.. doctest::

>>> from sklearn_xarray.datasets import load_digits_dataarray
>>> X = load_digits_dataarray(load_images=True)
>>> X # doctest:+ELLIPSIS
<xarray.DataArray (sample: 1797, row: 8, col: 8)>
array([[[ 0., 0., 5., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 15., 5., 0.],
[ 0., 3., 15., ..., 11., 8., 0.],
...,
[ 0., 4., 16., ..., 16., 6., 0.],
[ 0., 8., 16., ..., 16., 8., 0.],
[ 0., 1., 8., ..., 12., 1., 0.]]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
* row (row) int64 0 1 2 3 4 5 6 7
* col (col) int64 0 1 2 3 4 5 6 7
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8

.. py:currentmodule:: sklearn_xarray.preprocessing

In order to use the individual images as samples to fit an estimator, we need
to vectorize them first. The :py:class:`Featurizer` combines all dimensions
of the array except for the sample dimension:

.. doctest::

>>> from sklearn_xarray.preprocessing import Featurizer
>>> Featurizer().fit_transform(X)
<xarray.DataArray (sample: 1797, feature: 64)>
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8
* feature (feature) MultiIndex
- col (feature) int64 0 0 0 0 0 0 0 0 1 1 1 1 ... 6 6 6 6 7 7 7 7 7 7 7 7
- row (feature) int64 0 1 2 3 4 5 6 7 0 1 2 3 ... 4 5 6 7 0 1 2 3 4 5 6 7

Other transformers for combining dimensions are:

.. autosummary::
:nosignatures:

Concatenator
Featurizer
Stacker

Check out the :ref:`examples<examples>` for more use cases.


Transformers changing the number of samples
-------------------------------------------

There are several transformers that change the number of samples in the data,
namely:

.. py:currentmodule:: sklearn_xarray.preprocessing

.. autosummary::
:nosignatures:

Resampler
Sanitizer
Segmenter
Splitter
Stacker

These kinds of transformer are usually disallowed by sklearn, because the
package does not provide any mechanism of also changing the number of samples
Expand Down Expand Up @@ -83,25 +153,25 @@ specify the ``groupby`` parameter:
>>>
>>> X = load_wisdm_dataarray()
>>> Xt = segmenter.fit_transform(X)
>>> Xt # doctest:+ELLIPSIS doctest:+NORMALIZE_WHITESPACE
>>> Xt # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
<xarray.DataArray 'tmptmp' (sample: 54813, axis: 3, timepoint: 20)>
array([[[ -0.15 , 0.11 , ..., -2.26 , -1.46 ],
[ 9.15 , 9.19 , ..., 9.72 , 9.81 ],
[ -0.34 , 2.76 , ..., 2.03 , 2.15 ]],
[[ 0.27 , -3.06 , ..., -2.56 , -2.6 ],
[ 12.57 , 13.18 , ..., 14.56 , 8.96 ],
[ 5.37 , 6.47 , ..., 0.31 , -3.3 ]],
...,
[[ -0.3 , 0.27 , ..., 0.42 , 3.17 ],
[ 8.08 , 6.63 , ..., 10.5 , 9.23 ],
[ 0.994285, 0.994285, ..., -5.175732, -4.671779]],
[[ 5.33 , 6.44 , ..., -4.14 , -4.9 ],
[ 8.39 , 9.04 , ..., 6.21 , 6.55 ],
[ -4.794363, -2.179256, ..., 5.938472, 3.827318]]])
array([[[-0.15 , 0.11 , ..., -2.26 , -1.46 ],
[ 9.15 , 9.19 , ..., 9.72 , 9.81 ],
[-0.34 , 2.76 , ..., 2.03 , 2.15 ]],
[[ 0.27 , -3.06 , ..., -2.56 , -2.6 ],
[12.57 , 13.18 , ..., 14.56 , 8.96 ],
[ 5.37 , 6.47 , ..., 0.31 , -3.3 ]],
...
[[-0.3 , 0.27 , ..., 0.42 , 3.17 ],
[ 8.08 , 6.63 , ..., 10.5 , 9.23 ],
[ 0.99... , 0.99... , ..., -5.17... , -4.67... ]],
[[ 5.33 , 6.44 , ..., -4.14 , -4.9 ],
[ 8.39 , 9.04 , ..., 6.21 , 6.55 ],
[-4.79... , -2.17... , ..., 5.93... , 3.82... ]]])
Coordinates:
* axis (axis) <U1 'x' 'y' 'z'
* timepoint (timepoint) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* sample (sample) datetime64[ns] 1970-01-01T13:25:37.050000 ... 1970-01-01T03:12:42.100000
* sample (sample) datetime64[ns] 1970-01-01T13:25:37.050000 ... 1970-01...
subject (sample, timepoint) int64 1 1 1 1 1 1 1 ... 36 36 36 36 36 36 36
activity (sample, timepoint) object 'Downstairs' ... 'Walking'

Expand Down
14 changes: 13 additions & 1 deletion docs/content/whatsnew.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@ What's New
==========


v0.5.0 (unreleased)
-------------------------

Enhancements
~~~~~~~~~~~~

- New ``Stacker`` transformer that provides a transformer interface to
xarray's ``stack``/``unstack`` methods (thanks to @mmann1123 for the input).
- Un-deprecated the ``transformer`` parameter of the ``Target`` class and
added an ``inverse_transform`` method that reverses the transformation.


v0.4.0 (June 18, 2020)
-------------------------

Expand All @@ -17,7 +29,7 @@ Enhancements

- The package can now be installed via conda::

conda install -c phausamann -c conda-forge sklearn-xarray
conda install -c phausamann sklearn-xarray



Expand Down
2 changes: 1 addition & 1 deletion docs/content/wrappers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Wrappers for sklearn estimators
sklearn-xarray provides wrappers that let you use sklearn estimators on
xarray DataArrays and Datasets. The goal is to provide a seamless integration
of both packages by only applying estimator methods on the raw data while
metadata (coordinates in xarray) remains untouched whereever possible.
metadata (coordinates in xarray) remains untouched whenever possible.

There are two principal data types in xarray: ``DataArray`` and ``Dataset``.
The wrappers provided in this package will determine automatically which
Expand Down
12 changes: 5 additions & 7 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,18 @@ name: sklearn-xarray
channels:
- conda-forge
dependencies:
- python=3.6
- python=3.7
- numpy
- scipy
- scikit-learn
- pandas
- xarray
- pytest
- matplotlib
- sphinx
- pillow
- sphinx==2.4.4
- sphinx-gallery
- sphinx_rtd_theme
- numpydoc
- bump2version
- pre-commit
- flake8
- black=19.10b0
# install sklearn-xarray in development mode through pip
- pip:
- -e .
8 changes: 3 additions & 5 deletions examples/README.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
.. _general_examples:
.. _examples:

General examples
================

Introductory examples.
Examples
========
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
scikit-learn==0.23.1
xarray==0.15.1
scikit-learn==0.24.1
xarray==0.16.2
pandas==1.0.4
numpy==1.18.5
scipy==1.4.1
4 changes: 1 addition & 3 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
pytest==5.4.3
sphinx=2.4.4
sphinx==2.4.4
sphinx_rtd_theme==0.4.3
sphinx-gallery==0.7.0
numpydoc==1.0.0
matplotlib==3.2.1
pillow==7.1.2
bump2version==1.0.0
Loading