Revisiting Hierarchical Text Classification: Inference and Metrics 🔍

Official implementation of Revisiting Hierarchical Text Classification: Inference and Metrics, CoNLL 2024.

Abstract

Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC.

Key features 🚀

Hierarchical Wikivitals : a new challenging dataset 📊

We present Hierarchical Wikivitals, a novel high-quality HTC dataset, extracted from Wikipedia. Equipped with a deep and complex hierarchy, it provides a harder challenge.

Logit adjusted conditional softmax 🆕

The conditional probability is computed as follows :

$$ \Huge \hat{\mathbb{P}}(y|x, \pi(y)) = \frac{e^{s_x^{[y]} + \tau\log~\nu(y|\pi(y))}}{\underset{z\in\mathcal{C}(\pi(y))}{\sum}e^{s_x^{[z]} + \tau\log~\nu(z|\pi(z))}} $$

The loss function is defined as:

$$ \Huge l_{\mathrm{CSoft,LA}}(x, Y) = -\sum_{y \in Y}\log\hat{\mathbb{P}}(y|x, \pi(y)) $$

For full details regarding the derivation and implications of these formulas, please refer to the article.

A fair methodology of evaluation

We quantitatively evaluate HTC methods based on specifically designed hierarchical metrics and with a rigorous methodology.

Code implementation 💻

Installation

Clone the repository:

git clone https://github.com/RomanPlaud/revisitingHTC.git
cd revisitingHTC

Create and activate the conda environment:

conda create -n revisiting_htc_env --file revisiting-htc.txt
conda activate revisiting_htc_env

Dataset Preparation

Our newly introduced dataset is available here. Feel free to use it for your experiments. This dataset is released under the MIT License.

Datasets

Obtain the RCV1, WOS, and BGC datasets by referring to:

Ensure the datasets match the format:

{
  "token": ["Sample input text"],
  "label": ["Category", "Subcategory", "Further Subcategory"]
}

In addition a taxonomy file (such as hwv.taxonomy) is required where each line represents a parent category followed by its children, separated by tabs. Ensure all labels used in the dataset are covered.

Example :

Root	Science	Technology	Arts
Science	Physics	Chemistry	Biology

Tokenization

To accelerate the training process, you can tokenize your dataset. Below are the instructions for tokenizing the HWV dataset:

python3 tokenize_dataset.py --data_train_path data/HWV/hwv_train.json --data_test_path data/HWV/hwv_test.json --data_valid_path data/HWV/hwv_val.json --config_file data/HWV/config_hwv.json

Train

To reproduce the results of our article, execute the following command:

bash bash_files/hwv/train_hwv_hitin_cond_softmax_la.sh

You may also use any other bash file contained in the bash_files folder.

Note: If your dataset is not tokenized, please set "tokenized" to false in the config file and update the paths to the dataset accordingly.

Evaluation

python3 evaluate.py --config_file configs/aaaa_final_hwv/vanilla_bert_hwv_conditional_softmax_la.json  --model_file ckpt/1001_1653_vanilla_bert_hwv_conditional_softmax_la/best_micro_Origin --output_file results_hwv_conditional_softmax_la.json

License

This project and dataset is released under the MIT License.

Citation

@inproceedings{plaud-etal-2024-revisiting,
    title = "Revisiting Hierarchical Text Classification: Inference and Metrics",
    author = "Plaud, Roman  and
      Labeau, Matthieu  and
      Saillenfest, Antoine  and
      Bonald, Thomas",
    editor = "Barak, Libby  and
      Alikhani, Malihe",
    booktitle = "Proceedings of the 28th Conference on Computational Natural Language Learning",
    month = nov,
    year = "2024",
    address = "Miami, FL, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.conll-1.18",
    doi = "10.18653/v1/2024.conll-1.18",
    pages = "231--242",
    abstract = "Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC",
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
bash_files		bash_files
configs		configs
data		data
data_modules		data_modules
figures		figures
helper		helper
models		models
train_modules		train_modules
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
revisitingHTC.txt		revisitingHTC.txt
tokenize_dataset.py		tokenize_dataset.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Revisiting Hierarchical Text Classification: Inference and Metrics 🔍

Abstract

Key features 🚀

Hierarchical Wikivitals : a new challenging dataset 📊

Logit adjusted conditional softmax 🆕

A fair methodology of evaluation

Code implementation 💻

Installation

Dataset Preparation

Datasets

Tokenization

Train

Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Languages

RomanPlaud/revisitingHTC

Folders and files

Latest commit

History

Repository files navigation

Revisiting Hierarchical Text Classification: Inference and Metrics 🔍

Abstract

Key features 🚀

Hierarchical Wikivitals : a new challenging dataset 📊

Logit adjusted conditional softmax 🆕

A fair methodology of evaluation

Code implementation 💻

Installation

Dataset Preparation

Datasets

Tokenization

Train

Evaluation

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages