CAALM

Repository for Combining Autoregressive and Autoencoder Language Models for Text Classification

Introduction

This repository contains the code and data for the paper:

"Combining Autoregressive and Autoencoder Language Models for Text Classification"

Author: João Gonçalves

Repository Structure

Folders:

Corona/: Contains data and scripts related to the CoronaNet dataset experiments.
Hate/: Contains data and scripts for the hate speech classification experiments.
Military/: Contains data and scripts for the military stance detection experiments.
Morality/: Contains data and scripts for the morality stance detection experiments.

Scripts and Files:

analysis_script_updated.R: R script for analyzing baseline experimental results.
nemo_analysis.R: R script for analyzing results with Nemo labels only.
autoregressive_generation.py: Python script for generating intermediate texts using an autoregressive model.
test_loop_BERT.py: Python script for training and evaluating BERT-based models.
test_loop_BERT_NLI.py: Python script for training and evaluating BERT-NLI models.
all_results.csv, nemo_results.csv, zero_shot_results.csv: CSV files containing experimental results.
*.png: Plot images visualizing the results.

Datasets

The following datasets are used in this project:

CoronaNet Dataset:

Source: https://www.nature.com/articles/s41562-020-0909-7

Hate Speech Dataset:

Source: https://ojs.aaai.org/index.php/ICWSM/article/view/14955

Military and Traditional Morality Stance Detection Datasets:

Source: https://manifesto-project.wzb.eu/information/documents/corpus

Usage

Generating intermediate texts can be done by running autoregressive_generation.py. Currently, the script is configured to take Mistral Nemo as the autoregressive model and use the hate speech detection dataset. Usage of different models, datasets, and classification instructions needs to be edited in the file directly.

The test_loop Python files replicate the analyses in the paper for baseline BERT models and NLI models. They can be adjusted to classify other datasets with CAALM generated intermediate texts.

Funding

This research was funded by a VENI grant VI.Veni.221S.154 from the Dutch Research Council (NWO). The funding sources had no involvement in the study design; in the collection, analysis and interpretation of data; in the writing of the report; and in the decision to submit the article for publication.

Acknowledgements

The scripts and testing approach in this repository draw significantly from https://github.com/MoritzLaurer/less-annotating-with-bert-nli.

Laurer, M., Van Atteveldt, W., Casas, A., & Welbers, K. (2023). Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI. Political Analysis, 1–33. https://doi.org/10.1017/pan.2023.20

OpenAI's o1-preview model was used to speed up commenting and streamlining of the code in this repository to provide contextual information and make it more accessible. It was not used for the research paper.

To do

Create demo file that makes the CAALM pipeline accessible for any user defined datasets, models and classification task.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Corona		Corona
Hate		Hate
Military		Military
Morality		Morality
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
accuracy_not_b_diff_Corona.png		accuracy_not_b_diff_Corona.png
accuracy_not_b_diff_Hate.png		accuracy_not_b_diff_Hate.png
accuracy_not_b_diff_Military.png		accuracy_not_b_diff_Military.png
accuracy_not_b_diff_Morality.png		accuracy_not_b_diff_Morality.png
all_results.csv		all_results.csv
analysis_script_updated.R		analysis_script_updated.R
autoregressive_generation.py		autoregressive_generation.py
average_f1_macro_Corona.png		average_f1_macro_Corona.png
average_f1_macro_Hate.png		average_f1_macro_Hate.png
average_f1_macro_Military.png		average_f1_macro_Military.png
average_f1_macro_Morality.png		average_f1_macro_Morality.png
f1_macro_diff_Corona.png		f1_macro_diff_Corona.png
f1_macro_diff_Hate.png		f1_macro_diff_Hate.png
f1_macro_diff_Military.png		f1_macro_diff_Military.png
f1_macro_diff_Morality.png		f1_macro_diff_Morality.png
nemo_analysis.R		nemo_analysis.R
nemo_results.csv		nemo_results.csv
test_loop_BERT.py		test_loop_BERT.py
test_loop_BERT_NLI.py		test_loop_BERT_NLI.py
zero_shot_results.csv		zero_shot_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CAALM

Introduction

Repository Structure

Datasets

Usage

Funding

Acknowledgements

To do

About

Uh oh!

Releases

Packages

Languages

Joaoffg/CAALM

Folders and files

Latest commit

History

Repository files navigation

CAALM

Introduction

Repository Structure

Datasets

Usage

Funding

Acknowledgements

To do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages