From e6879c118426fe57dfdf57eb598f1db823344eb1 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 20:08:40 +0200 Subject: [PATCH 1/5] rewords rus user guide --- doc/under_sampling.rst | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..d763b8084 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -77,6 +77,12 @@ and are meant for cleaning the feature space. Controlled under-sampling techniques ------------------------------------ +Controlled under-sampling techniques reduce the number of observations of the majority +classes (targeted classes) to a number specified by the user. + +Random under-sampling +^^^^^^^^^^^^^^^^^^^^^ + :class:`RandomUnderSampler` is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes:: @@ -91,9 +97,9 @@ randomly selecting a subset of data for the targeted classes:: :scale: 60 :align: center -:class:`RandomUnderSampler` allows to bootstrap the data by setting -``replacement`` to ``True``. The resampling with multiple classes is performed -by considering independently each targeted class:: +:class:`RandomUnderSampler` allows bootstrapping the data by setting +``replacement`` to ``True``. When there are multiple classes, each targeted class is +under-sampled independently:: >>> import numpy as np >>> print(np.vstack([tuple(row) for row in X_resampled]).shape) @@ -103,8 +109,8 @@ by considering independently each targeted class:: >>> print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape) (181, 2) -In addition, :class:`RandomUnderSampler` allows to sample heterogeneous data -(e.g. containing some strings):: +:class:`RandomUnderSampler` works with numrical and also categorical variables +(e.g. where the values are strings):: >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]], ... dtype=object) @@ -116,7 +122,8 @@ In addition, :class:`RandomUnderSampler` allows to sample heterogeneous data >>> print(y_resampled) [0 1] -It would also work with pandas dataframe:: +:class:`RandomUnderSampler` can also take a pandas dataframe as input for +undersampling:: >>> from sklearn.datasets import fetch_openml >>> df_adult, y_adult = fetch_openml( From 1b84f58234e734712ba657338ed134e54e979121 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 20:12:03 +0200 Subject: [PATCH 2/5] final touches --- doc/under_sampling.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index d763b8084..2a9b8ff7c 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -77,7 +77,7 @@ and are meant for cleaning the feature space. Controlled under-sampling techniques ------------------------------------ -Controlled under-sampling techniques reduce the number of observations of the majority +Controlled under-sampling techniques reduce the number of observations from the majority classes (targeted classes) to a number specified by the user. Random under-sampling @@ -109,8 +109,8 @@ under-sampled independently:: >>> print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape) (181, 2) -:class:`RandomUnderSampler` works with numrical and also categorical variables -(e.g. where the values are strings):: +:class:`RandomUnderSampler` can undersample numerical and also categorical variables +(i.e., where the values are strings):: >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]], ... dtype=object) From 16161e10ff77c0d53633375af74ab29f717c4980 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 12:03:54 +0200 Subject: [PATCH 3/5] reword Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 2a9b8ff7c..c42c65451 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -109,8 +109,7 @@ under-sampled independently:: >>> print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape) (181, 2) -:class:`RandomUnderSampler` can undersample numerical and also categorical variables -(i.e., where the values are strings):: +:class:`RandomUnderSampler` handles heterogeneous data types, i.e. numerical, categorical, date, etc.:: >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]], ... dtype=object) From 7be1406fb369c7e6ab7fbfec9c845aa6bbf1cfda Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 12:04:23 +0200 Subject: [PATCH 4/5] reword Co-authored-by: Guillaume Lemaitre --- doc/under_sampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index c42c65451..0442e0159 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -121,7 +121,7 @@ under-sampled independently:: >>> print(y_resampled) [0 1] -:class:`RandomUnderSampler` can also take a pandas dataframe as input for +:class:`RandomUnderSampler` also supports pandas dataframes as input for undersampling:: >>> from sklearn.datasets import fetch_openml From 717d7ce6564439faf24098780b5ff4d3000c41c3 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 12:08:20 +0200 Subject: [PATCH 5/5] small cosmetic changes --- doc/under_sampling.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 0442e0159..a581508bf 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -77,8 +77,8 @@ and are meant for cleaning the feature space. Controlled under-sampling techniques ------------------------------------ -Controlled under-sampling techniques reduce the number of observations from the majority -classes (targeted classes) to a number specified by the user. +Controlled under-sampling techniques reduce the number of observations from the +targeted classes to a number specified by the user. Random under-sampling ^^^^^^^^^^^^^^^^^^^^^ @@ -109,7 +109,8 @@ under-sampled independently:: >>> print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape) (181, 2) -:class:`RandomUnderSampler` handles heterogeneous data types, i.e. numerical, categorical, date, etc.:: +:class:`RandomUnderSampler` handles heterogeneous data types, i.e. numerical, +categorical, dates, etc.:: >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]], ... dtype=object)