|
2 | 2 | Handling Data with Many Labels Using Linear Methods
|
3 | 3 | ====================================================
|
4 | 4 |
|
5 |
| -For the case that the amount of labels is very large, |
6 |
| -the training time of the standard ``train_1vsrest`` method may be unpleasantly long. |
7 |
| -The ``train_tree`` method in LibMultiLabel can vastly improve the training time on such data sets. |
| 5 | +For datasets with a very large number of labels, the training time of the standard ``train_1vsrest`` method can be prohibitively long. LibMultiLabel offers tree-based methods like ``train_tree`` and ``train_ensemble_tree`` to vastly improve training time in such scenarios. |
8 | 6 |
|
9 |
| -To illustrate this speedup, we will use the `EUR-Lex dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#EUR-Lex>`_, which contains 3,956 labels. |
10 |
| -The data in the following example is downloaded under the directory ``data/eur-lex`` |
11 | 7 |
|
12 |
| -Users can use the following command to easily apply the ``train_tree`` method. |
13 |
| -
|
14 |
| -.. code-block:: bash |
15 |
| -
|
16 |
| - $ python3 main.py --training_file data/eur-lex/train.txt |
17 |
| - --test_file data/eur-lex/test.txt |
18 |
| - --linear |
19 |
| - --linear_technique tree |
20 |
| -
|
21 |
| -Besides CLI usage, users can also use API to apply ``train_tree`` method. |
22 |
| -Below is an example. |
| 8 | +We will use the `EUR-Lex dataset <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html#EUR-Lex>`_, which contains 3,956 labels. The data is assumed to be downloaded under the directory ``data/eur-lex``. |
23 | 9 | """
|
24 | 10 |
|
25 | 11 | import math
|
26 | 12 | import libmultilabel.linear as linear
|
27 | 13 | import time
|
28 | 14 |
|
| 15 | +# Load and preprocess the dataset |
29 | 16 | datasets = linear.load_dataset("txt", "data/eurlex/train.txt", "data/eurlex/test.txt")
|
30 | 17 | preprocessor = linear.Preprocessor()
|
31 | 18 | datasets = preprocessor.fit_transform(datasets)
|
32 | 19 |
|
33 | 20 |
|
| 21 | +###################################################################### |
| 22 | +# Standard Training and Prediction |
| 23 | +# -------------------------------- |
| 24 | +# |
| 25 | +# Users can use the following command to easily apply the ``train_tree`` method. |
| 26 | +# |
| 27 | +# .. code-block:: bash |
| 28 | +# |
| 29 | +# $ python3 main.py --training_file data/eur-lex/train.txt \\ |
| 30 | +# --test_file data/eur-lex/test.txt \\ |
| 31 | +# --linear \\ |
| 32 | +# --linear_technique tree |
| 33 | +# |
| 34 | +# Besides CLI usage, users can also use API to apply ``train_tree`` method. |
| 35 | +# Below is an example. |
| 36 | + |
34 | 37 | training_start = time.time()
|
35 | 38 | # the standard one-vs-rest method for multi-label problems
|
36 | 39 | ovr_model = linear.train_1vsrest(datasets["train"]["y"], datasets["train"]["x"])
|
@@ -99,3 +102,81 @@ def metrics_in_batches(model):
|
99 | 102 | print("Score of 1vsrest:", metrics_in_batches(ovr_model))
|
100 | 103 | print("Score of tree:", metrics_in_batches(tree_model))
|
101 | 104 |
|
| 105 | + |
| 106 | +###################################################################### |
| 107 | +# Ensemble of Tree Models |
| 108 | +# ----------------------- |
| 109 | +# |
| 110 | +# While the ``train_tree`` method offers a significant speedup, its accuracy can sometimes be slightly lower than the standard one-vs-rest approach. |
| 111 | +# The ``train_ensemble_tree`` method can help bridge this gap by training multiple tree models and averaging their predictions. |
| 112 | +# |
| 113 | +# Users can use the following command to easily apply the ``train_ensemble_tree`` method. |
| 114 | +# The number of trees in the ensemble can be controlled with the ``--tree_ensemble_models`` argument. |
| 115 | +# |
| 116 | +# .. code-block:: bash |
| 117 | +# |
| 118 | +# $ python3 main.py --training_file data/eur-lex/train.txt \\ |
| 119 | +# --test_file data/eur-lex/test.txt \\ |
| 120 | +# --linear \\ |
| 121 | +# --linear_technique tree \\ |
| 122 | +# --tree_ensemble_models 3 |
| 123 | +# |
| 124 | +# This command trains an ensemble of 3 tree models. If ``--tree_ensemble_models`` is not specified, it defaults to 1 (a single tree). |
| 125 | +# |
| 126 | +# Besides CLI usage, users can also use the API to apply the ``train_ensemble_tree`` method. |
| 127 | +# Below is an example. |
| 128 | + |
| 129 | +# We have already trained a single tree model as a baseline. |
| 130 | +# Now, let's train an ensemble of 3 tree models. |
| 131 | +training_start = time.time() |
| 132 | +ensemble_model = linear.train_ensemble_tree( |
| 133 | + datasets["train"]["y"], datasets["train"]["x"], n_trees=3 |
| 134 | +) |
| 135 | +training_end = time.time() |
| 136 | +print("Training time of ensemble tree: {:10.2f}".format(training_end - training_start)) |
| 137 | + |
| 138 | +###################################################################### |
| 139 | +# On a machine with an AMD-7950X CPU, |
| 140 | +# the ``train_ensemble_tree`` function with 3 trees took `421.15` seconds, |
| 141 | +# while the single tree took `144.37` seconds. |
| 142 | +# As expected, training an ensemble takes longer, roughly proportional to the number of trees. |
| 143 | +# |
| 144 | +# Now, let's see if this additional training time translates to better performance. |
| 145 | +# We'll compute the same P@K metrics on the test set for both the single tree and the ensemble model. |
| 146 | + |
| 147 | +# `tree_preds` and `target` are already computed in the previous section. |
| 148 | +ensemble_preds = linear.predict_values(ensemble_model, datasets["test"]["x"]) |
| 149 | + |
| 150 | +# `tree_score` is already computed. |
| 151 | +print("Score of single tree:", tree_score) |
| 152 | + |
| 153 | +ensemble_score = linear.compute_metrics(ensemble_preds, target, ["P@1", "P@3", "P@5"]) |
| 154 | +print("Score of ensemble tree:", ensemble_score) |
| 155 | + |
| 156 | +###################################################################### |
| 157 | +# While training an ensemble takes longer, it often leads to better predictive performance. |
| 158 | +# The following table shows a comparison between a single tree and ensembles |
| 159 | +# of 3, 10, and 15 trees on several benchmark datasets. |
| 160 | +# |
| 161 | +# .. table:: Benchmark Results for Single and Ensemble Tree Models (P@K in %) |
| 162 | +# |
| 163 | +# +---------------+-----------------+-------+-------+-------+ |
| 164 | +# | Dataset | Model | P@1 | P@3 | P@5 | |
| 165 | +# +===============+=================+=======+=======+=======+ |
| 166 | +# | EURLex-4k | Single Tree | 82.35 | 68.98 | 57.62 | |
| 167 | +# | +-----------------+-------+-------+-------+ |
| 168 | +# | | Ensemble-3 | 82.38 | 69.28 | 58.01 | |
| 169 | +# | +-----------------+-------+-------+-------+ |
| 170 | +# | | Ensemble-10 | 82.74 | 69.66 | 58.39 | |
| 171 | +# | +-----------------+-------+-------+-------+ |
| 172 | +# | | Ensemble-15 | 82.61 | 69.56 | 58.29 | |
| 173 | +# +---------------+-----------------+-------+-------+-------+ |
| 174 | +# | EURLex-57k | Single Tree | 90.77 | 80.81 | 67.82 | |
| 175 | +# | +-----------------+-------+-------+-------+ |
| 176 | +# | | Ensemble-3 | 91.02 | 81.06 | 68.26 | |
| 177 | +# | +-----------------+-------+-------+-------+ |
| 178 | +# | | Ensemble-10 | 91.23 | 81.22 | 68.34 | |
| 179 | +# | +-----------------+-------+-------+-------+ |
| 180 | +# | | Ensemble-15 | 91.25 | 81.31 | 68.34 | |
| 181 | +# +---------------+-----------------+-------+-------+-------+ |
| 182 | + |
0 commit comments