Skip to content

Commit a0b9659

Browse files
authored
Merge pull request #12 from GateNLP/huggingface-models
Refactor to unify the BERTweet and Twitter-XLM code bases
2 parents 2a2fa21 + 197f4e1 commit a0b9659

23 files changed

+217
-947
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@ resources/
33
resources.tar.gz
44
models/
55
models.tar.gz
6+
extra-models/
67
__pycache__/
78
training/models/
8-
*.pyc

README.md

Lines changed: 13 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,23 @@
11
# StanceClassifier
2-
Stance Classifier for the WeVerify project
2+
Stance Classifier for the WeVerify project, to determine the stance (support, deny, question, comment) of a "reply" tweet or other social media post towards the original "target" post to which it is replying.
33

4-
This is a re-implementation of Aker et al. (2017) ["Simple Open Stance Classification for Rumour Analysis"](https://arxiv.org/pdf/1708.05286.pdf). We replaced the Bag-of-words and BROWN features with [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) (twitter embeddings with 200 dimensions).
4+
> This is the latest version of the classifier, with target-aware and target-oblivious models based on BERTweet and XLM-RoBERTa. For the older multilingual BERT-based models see the [bert-model branch](https://github.com/GateNLP/StanceClassifier/tree/bert-model)
55
6-
## Configuration
7-
1) Requirements:
6+
## Available models
87

9-
python3.7
10-
nltk
11-
numpy
12-
scipy
13-
sklearn
14-
15-
Get Vader lexicon:
16-
`python -m nltk.downloader vader_lexicon`
8+
There are two versions of the stance classifier available:
179

18-
2) Clone this repository
19-
20-
3) Download the [resources](https://github.com/GateNLP/StanceClassifier/releases/download/v0.1/resources.tar.gz) required for feature extraction and extract it inside the main folder (`StanceClassifer`)
21-
22-
4) Download the trained model or models that you want to use - each model is provided as a `tar.gz` file which should be unpacked in this folder, e.g. `curl -L <download_url> | tar xvzf -`
23-
24-
- [Ensemble model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-ensemble.tar.gz): an English-language feature-based ensemble model built with Logistic Regression, Random Forest and Multi-Layer Perceptron classifiers
25-
- [English BERT-based model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-bert-english.tar.gz): Monolingual English BERT-based model, i.e. fine-tuning of BERT for the rumour stance classification task, using threshold moving for imbalanced data treatment
26-
- [Multilingual BERT-based model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-bert-multi.tar.gz): Multilingual version of the BERT-based model, tuned on the same English data as the above but based on the [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) underlying model
10+
- "target aware" - a mode that uses both the reply post _and_ the target post together to determine the stance. This uses two models fine-tuned from [vinai/bertweet-base](https://huggingface.co/vinai/bertweet-base), one for just the reply and one for the target and reply together
11+
- "target oblivious" - a mode that uses just the reply post, without reference to the target. This model is fine tuned from [cardiffnlp/twitter-xlm-roberta-base](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base), which is a multilingual model, but the fine tuning data is still the RumourEval 2019 data set which is English only.
2712

2813
## Usage
2914

3015
### Basic usage
3116
```
32-
python -m StanceClassifier -l <LANGUAGE> -s <ORIGINAL_JSON> -o <REPLY_JSON> -c <MODEL>
17+
python -m StanceClassifier reply.json [original.json]
3318
```
34-
Supported model names for the `-c` option: `ens` (ensemble), `bert-english` (English BERT model), `bert-multilingual` (multilingual BERT model)
19+
20+
The reply and (if provided) original arguments should be JSON files containing tweets in JSON format with at least a `"text"` or `"full_text"` property containing the text. If only a reply is provided then the target-oblivious model will be used, if an original tweet is provided as well then the ensemble model will be used that combines a target-oblivious and a target-aware model and picks the best classification.
3521

3622
The output is a class:
3723
- 0.0 = support
@@ -45,33 +31,13 @@ The folder `examples` contains examples of original tweets and replies:
4531
- original_old and reply_old are examples of the old JSON files (140 characters)
4632
- original_new and reply_new are examples of the new JSON files (280 characters)
4733

48-
### `StanceClassifer` class (StanceClassifier.stance_classifier.StanceClassifier)
49-
This is the main class in this project. If you want to add this project as part of your own project, you should import this class.
34+
### Programmatic usage
5035

51-
### Server usage
52-
We have implemented TCP and HTTP servers. Server parameters are defined in the `configurations.txt` file.
36+
The project provides three main classes, `StanceClassifer` for target-oblivious stance detection and `StanceClassifierEnsemble` for the target-aware ensemble model, plus `StanceClassifierWithTarget` that uses _only_ the target-aware model. All these classes can be imported `from StanceClassifier.stance_classifier`, and will download their models from HuggingFace on first use.
5337

54-
To run the TCP server:
55-
```
56-
python Run_TCP_StanceClassifier_Server.py
57-
```
58-
59-
Testing the TCP server:
60-
```
61-
python Test_Run_TCP_StanceClassifier_Server.py
62-
```
63-
64-
The HTTP server uses a TCP server already running:
65-
```
66-
python Run_HTTP_StanceClassifier_Server.py
67-
```
68-
69-
To test the HTTP server:
70-
```
71-
python Test_Run_HTTP_StanceClassifier_Server.py
72-
```
38+
### Server usage
7339

74-
In addition, the `docker` directory contains configuration to build a Docker image running a particular model of the classifier as an HTTP endpoint compliant with the [API specification](https://european-language-grid.readthedocs.io/en/release1.1.2/all/A2_API/LTInternalAPI.html) of the [European Language Grid](https://www.european-language-grid.eu).
40+
The `docker` directory contains configuration to build a Docker image running a particular model of the classifier as an HTTP endpoint compliant with the [API specification](https://european-language-grid.readthedocs.io/en/release1.1.2/all/A2_API/LTInternalAPI.html) of the [European Language Grid](https://www.european-language-grid.eu).
7541

7642
### Training new models
7743
To train new models, you can edit `train_model.py` (more support will be given in the future). To run:

Run_HTTP_StanceClassifier_Server.py

Lines changed: 0 additions & 4 deletions
This file was deleted.

Run_TCP_StanceClassifier_Server.py

Lines changed: 0 additions & 6 deletions
This file was deleted.

StanceClassifier/__main__.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,32 @@
1-
import sys
21
import argparse
32
import json
4-
from StanceClassifier.stance_classifier import StanceClassifier
3+
from StanceClassifier import stance_classifier
54
import time
65

76

87

98
#Load config:
109
#configurations = loadResources('../configurations.txt')
1110

12-
parser = argparse.ArgumentParser(description='Multilingual Reply Only Stance classifier.')
13-
parser.add_argument('-s', help='stance file to be classified (json file)')
14-
11+
parser = argparse.ArgumentParser(description='Stance classifier.', epilog='If only a reply is provided, the classifier will use the multilingual target-oblivious model. If both a reply and a target are provided then the classifier will use the target-aware ensemble model, which is currently English-only.')
12+
parser.add_argument('reply', type=argparse.FileType('r'), help='JSON file with the reply tweet')
13+
parser.add_argument('target', type=argparse.FileType('r'), nargs='?', help='JSON file with the target tweet (optional)')
1514

1615

1716
args = parser.parse_args()
1817

19-
v_args = vars(args)
2018
start_time = time.time()
21-
stance = json.load(open(v_args['s'], "r"))
19+
reply = json.load(args.reply)
2220

23-
classifier = StanceClassifier()
24-
print("--- %s seconds ---" % (time.time() - start_time))
21+
if args.target:
22+
classifier = stance_classifier.StanceClassifierEnsemble()
23+
target = json.load(args.target)
24+
25+
print("--- %s seconds ---" % (time.time() - start_time))
26+
print(classifier.classify_with_target(reply, target))
27+
else:
28+
classifier = stance_classifier.StanceClassifier()
29+
print("--- %s seconds ---" % (time.time() - start_time))
30+
print(classifier.classify(reply)) # result
2531

26-
print(classifier.classify(stance)) # result
2732
print("--- %s seconds ---" % (time.time() - start_time))

StanceClassifier/features/extract_features.py

Lines changed: 33 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,31 @@
22
# deny = 1
33
# query = 2
44
# comment = 3
5+
import functools
56

67
import numpy as np
7-
from joblib import dump, load
88
from transformers import AutoTokenizer
99
import json
1010
import glob
1111
import re
12-
from StanceClassifier.util import Util, path_from_root
1312
import emoji
1413
from nltk.tokenize import TweetTokenizer
1514
import string
1615

17-
class Features():
16+
class Features:
1817

19-
def __init__(self, tokenizer_PATH):
20-
21-
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_PATH)
18+
def __init__(self, tokenizer_PATH, tokenizer_kwargs=None, demojize=False):
19+
if tokenizer_kwargs is None:
20+
tokenizer_kwargs = {}
2221

22+
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_PATH, **tokenizer_kwargs)
23+
# Wrap the tokenizer in an LRU cache so we don't need to re-tokenize the reply text
24+
# twice when running in ensemble mode
25+
self.tokenizer = functools.lru_cache(maxsize=10)(self.tokenizer)
26+
self.demojize = demojize
27+
self.nltk_tokenizer = TweetTokenizer()
28+
29+
2330
def process_tweet_dict(self, tweet_dict):
2431

2532
if "text" not in tweet_dict.keys():
@@ -28,19 +35,31 @@ def process_tweet_dict(self, tweet_dict):
2835
text = tweet_dict["text"]
2936

3037

31-
tknzr = TweetTokenizer()
3238
FLAGS = re.MULTILINE | re.DOTALL
3339
text = re.sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "http", text, flags=FLAGS)
3440
text = re.sub(r"@\w+", "@user", text, flags=FLAGS)
35-
text_token = tknzr.tokenize(text)
41+
text_token = self.nltk_tokenizer.tokenize(text)
3642
text = " ".join(text_token)
43+
44+
if self.demojize:
45+
text = emoji.demojize(text)
3746

3847
return text
3948

40-
def extract_bert_input(self, reply_tweet_dict):
41-
42-
r_text = self.process_tweet_dict(reply_tweet_dict)
43-
49+
def extract_bert_input(self, tweet_dict, text_pair=None):
50+
"""
51+
Preprocess and tokenize the given tweet dict ready for the stance model.
52+
53+
:param tweet_dict: the tweet to process
54+
:param text_pair: optional supplementary text to send to the tokenizer. Typically this
55+
will be omitted in the target-oblivious case, or when encoding just the reply tweet,
56+
or it will be the text returned when preprocessing the reply, when preparing the
57+
original tweet (the target to which it is replying) in a target-aware scenario.
58+
:return: tuple with the encoded result of this tweet ready to pass to the model, and the
59+
preprocessed text that could be passed as text_pair to encode the next tweet in the chain.
60+
"""
61+
text = self.process_tweet_dict(tweet_dict)
62+
4463
# input of target-oblivious model
45-
encoded_reply = self.tokenizer(text=r_text, add_special_tokens=True, truncation=True, padding='max_length', max_length = 128, return_tensors="pt")
46-
return encoded_reply
64+
encoded = self.tokenizer(text=text, text_pair=text_pair, add_special_tokens=True, truncation=True, padding='max_length', max_length = 128, return_tensors="pt")
65+
return encoded, text

StanceClassifier/stance_classifier.py

Lines changed: 51 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,70 @@
11
import os
22
import json
33
import sys
4-
from joblib import load
54
import numpy as np
65
from .features.extract_features import Features
7-
from .util import Util, path_from_root
86
from .testing import test
97
from transformers import AutoModelForSequenceClassification,AutoTokenizer
108

119

1210

13-
class StanceClassifier():
11+
class StanceClassifier:
1412

15-
def __init__(self):
16-
17-
RESOURCES_PATH = path_from_root("resources.txt")
13+
def __init__(self, model="GateNLP/stance-twitter-xlm-target-oblivious", feature_extractor=None):
14+
if not feature_extractor:
15+
# Create a plain Features() instance loading its tokenizer from
16+
# the same place as the model
17+
feature_extractor = Features(model)
1818

19-
print("Loading resources")
20-
util = Util()
21-
self.resources = util.loadResources(RESOURCES_PATH)
22-
self.feature_extractor = Features(path_from_root(self.resources["tokenizer"]))
23-
self.model = AutoModelForSequenceClassification.from_pretrained(path_from_root(self.resources["model"]), num_labels=4)
19+
self.feature_extractor = feature_extractor
20+
self.model = AutoModelForSequenceClassification.from_pretrained(model, num_labels=4)
2421

2522

2623
def classify(self, reply):
2724

28-
encoded_reply = self.feature_extractor.extract_bert_input(reply)
25+
encoded_reply, _ = self.feature_extractor.extract_bert_input(reply)
2926
#print("stanceclassifier.classify....................", encoded_reply, encoded_source_reply)
3027
stance_class, stance_prob = test.predict_bertweet(encoded_reply, self.model)
3128

32-
return stance_class, stance_prob
29+
return stance_class, stance_prob
30+
31+
32+
class StanceClassifierWithTarget(StanceClassifier):
33+
34+
def __init__(self, model="GateNLP/stance-bertweet-target-aware", feature_extractor=None):
35+
if not feature_extractor:
36+
feature_extractor = Features(model, tokenizer_kwargs={"use_fast": False}, demojize=True)
37+
38+
super().__init__(model, feature_extractor)
39+
40+
def classify_with_target(self, reply, target):
41+
encoded_reply, reply_text = self.feature_extractor.extract_bert_input(reply)
42+
encoded_reply_and_target, _ = self.feature_extractor.extract_bert_input(target, reply_text)
43+
44+
stance_class, stance_prob = test.predict_bertweet(encoded_reply_and_target, self.model)
45+
46+
return stance_class, stance_prob
47+
48+
49+
class StanceClassifierEnsemble:
50+
"""
51+
Ensemble classifier that runs a target-oblivious and a target-aware model against
52+
the same pair of posts and returns whichever prediction is more confident.
53+
"""
54+
55+
def __init__(self, to_model="GateNLP/stance-bertweet-target-oblivious", ta_model="GateNLP/stance-bertweet-target-aware", feature_extractor=None):
56+
self.ta_classifier = StanceClassifierWithTarget(ta_model, feature_extractor)
57+
# Use the same feature extractor for both classifiers, whether that's the supplied one
58+
# or the one that was auto-created by the ta_classifier
59+
self.to_classifier = StanceClassifier(to_model, self.ta_classifier.feature_extractor)
60+
61+
62+
def classify_with_target(self, reply, target):
63+
# run both the target oblivious and the target aware model, and return whichever gives
64+
# the higher score
65+
stance_class_to, stance_prob_to = self.to_classifier.classify(reply)
66+
stance_class_ta, stance_prob_ta = self.ta_classifier.classify_with_target(reply, target)
67+
if stance_prob_to[stance_class_to] > stance_prob_ta[stance_class_ta]:
68+
return stance_class_to, stance_prob_to
69+
else:
70+
return stance_class_ta, stance_prob_ta

StanceClassifier/testing/test.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
import numpy as np
2-
from transformers import AutoModelForSequenceClassification
32
import torch
43
from scipy.special import softmax
54

@@ -23,9 +22,9 @@ def predict_bertweet(encoded_reply, model):
2322
return stance_prob, stance_prediction
2423

2524
def process_model_output(output_):
26-
# input: logits of output_TO and output_TA;
25+
# input: logits of output_
2726

28-
id2label = {0:"support", 1:"deny", 2:"query", 3:"comment"}
27+
#id2label = {0:"support", 1:"deny", 2:"query", 3:"comment"}
2928
output_ = softmax(output_) # transform logits
3029
ranking_ = np.argsort(output_)[::-1] # rank
3130
#return output_[ranking_[0]], id2label[ranking_[0]]

StanceClassifier/util.py

Lines changed: 0 additions & 37 deletions
This file was deleted.

Test_Run_HTTP_StanceClassifier_Server.py

Lines changed: 0 additions & 20 deletions
This file was deleted.

0 commit comments

Comments
 (0)