GateNLP
diff --git a/‎.gitignore
Lines changed: 1 addition & 1 deletion b/‎.gitignore
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 13 additions & 47 deletions b/‎README.md
Lines changed: 13 additions & 47 deletions
diff --git a/‎Run_HTTP_StanceClassifier_Server.py
Lines changed: 0 additions & 4 deletions b/‎Run_HTTP_StanceClassifier_Server.py
Lines changed: 0 additions & 4 deletions
diff --git a/‎Run_TCP_StanceClassifier_Server.py
Lines changed: 0 additions & 6 deletions b/‎Run_TCP_StanceClassifier_Server.py
Lines changed: 0 additions & 6 deletions
diff --git a/‎StanceClassifier/__main__.py
Lines changed: 15 additions & 10 deletions b/‎StanceClassifier/__main__.py
Lines changed: 15 additions & 10 deletions
diff --git a/‎StanceClassifier/features/extract_features.py
Lines changed: 33 additions & 14 deletions b/‎StanceClassifier/features/extract_features.py
Lines changed: 33 additions & 14 deletions
diff --git a/‎StanceClassifier/stance_classifier.py
Lines changed: 51 additions & 13 deletions b/‎StanceClassifier/stance_classifier.py
Lines changed: 51 additions & 13 deletions
diff --git a/‎StanceClassifier/testing/test.py
Lines changed: 2 additions & 3 deletions b/‎StanceClassifier/testing/test.py
Lines changed: 2 additions & 3 deletions
diff --git a/‎StanceClassifier/util.py
Lines changed: 0 additions & 37 deletions b/‎StanceClassifier/util.py
Lines changed: 0 additions & 37 deletions
diff --git a/‎Test_Run_HTTP_StanceClassifier_Server.py
Lines changed: 0 additions & 20 deletions b/‎Test_Run_HTTP_StanceClassifier_Server.py
Lines changed: 0 additions & 20 deletions
@@ -3,6 +3,6 @@ resources/
 resources.tar.gz
 models/
 models.tar.gz
+extra-models/
 __pycache__/
 training/models/
-*.pyc
@@ -1,37 +1,23 @@
 # StanceClassifier
-Stance Classifier for the WeVerify project
+Stance Classifier for the WeVerify project, to determine the stance (support, deny, question, comment) of a "reply" tweet or other social media post towards the original "target" post to which it is replying.
 
-This is a re-implementation of Aker et al. (2017) ["Simple Open Stance Classification for Rumour Analysis"](https://arxiv.org/pdf/1708.05286.pdf). We replaced the Bag-of-words and BROWN features with [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) (twitter embeddings with 200 dimensions). 
+> This is the latest version of the classifier, with target-aware and target-oblivious models based on BERTweet and XLM-RoBERTa.  For the older multilingual BERT-based models see the [bert-model branch](https://github.com/GateNLP/StanceClassifier/tree/bert-model)
 
-## Configuration
-1) Requirements:
+## Available models
 
-        python3.7
-        nltk
-        numpy
-        scipy
-        sklearn
-    
-        Get Vader lexicon: 
-        `python -m nltk.downloader vader_lexicon`
+There are two versions of the stance classifier available:
 
-2) Clone this repository
-
-3) Download the [resources](https://github.com/GateNLP/StanceClassifier/releases/download/v0.1/resources.tar.gz) required for feature extraction and extract it inside the main folder (`StanceClassifer`)
-
-4) Download the trained model or models that you want to use - each model is provided as a `tar.gz` file which should be unpacked in this folder, e.g. `curl -L <download_url> | tar xvzf -`
-
- - [Ensemble model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-ensemble.tar.gz): an English-language feature-based ensemble model built with Logistic Regression, Random Forest and Multi-Layer Perceptron classifiers
- - [English BERT-based model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-bert-english.tar.gz): Monolingual English BERT-based model, i.e. fine-tuning of BERT for the rumour stance classification task, using threshold moving for imbalanced data treatment
- - [Multilingual BERT-based model](https://github.com/GateNLP/StanceClassifier/releases/download/v0.2/model-bert-multi.tar.gz): Multilingual version of the BERT-based model, tuned on the same English data as the above but based on the [multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) underlying model
+- "target aware" - a mode that uses both the reply post _and_ the target post together to determine the stance.  This uses two models fine-tuned from [vinai/bertweet-base](https://huggingface.co/vinai/bertweet-base), one for just the reply and one for the target and reply together
+- "target oblivious" - a mode that uses just the reply post, without reference to the target.  This model is fine tuned from [cardiffnlp/twitter-xlm-roberta-base](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base), which is a multilingual model, but the fine tuning data is still the RumourEval 2019 data set which is English only.
 
 ## Usage
 
 ### Basic usage
 ```
-python -m StanceClassifier -l <LANGUAGE> -s <ORIGINAL_JSON> -o <REPLY_JSON> -c <MODEL>
+python -m StanceClassifier reply.json [original.json]
 ```
-Supported model names for the `-c` option: `ens` (ensemble), `bert-english` (English BERT model), `bert-multilingual` (multilingual BERT model)
+
+The reply and (if provided) original arguments should be JSON files containing tweets in JSON format with at least a `"text"` or `"full_text"` property containing the text.  If only a reply is provided then the target-oblivious model will be used, if an original tweet is provided as well then the ensemble model will be used that combines a target-oblivious and a target-aware model and picks the best classification.
 
 The output is a class:
  - 0.0 = support
@@ -45,33 +31,13 @@ The folder `examples` contains examples of original tweets and replies:
  - original_old and reply_old are examples of the old JSON files (140 characters)
  - original_new and reply_new are examples of the new JSON files (280 characters)
 
-### `StanceClassifer` class (StanceClassifier.stance_classifier.StanceClassifier)
-This is the main class in this project. If you want to add this project as part of your own project, you should import this class. 
+### Programmatic usage
 
-### Server usage
-We have implemented TCP and HTTP servers. Server parameters are defined in the `configurations.txt` file.
+The project provides three main classes, `StanceClassifer` for target-oblivious stance detection and `StanceClassifierEnsemble` for the target-aware ensemble model, plus `StanceClassifierWithTarget` that uses _only_ the target-aware model.  All these classes can be imported `from StanceClassifier.stance_classifier`, and will download their models from HuggingFace on first use.
 
-To run the TCP server:
-```
-python Run_TCP_StanceClassifier_Server.py
-```
-
-Testing the TCP server:
-```
-python Test_Run_TCP_StanceClassifier_Server.py
-```
-
-The HTTP server uses a TCP server already running:
-```
-python Run_HTTP_StanceClassifier_Server.py
-```
-
-To test the HTTP server:
-```
-python Test_Run_HTTP_StanceClassifier_Server.py
-```
+### Server usage
 
-In addition, the `docker` directory contains configuration to build a Docker image running a particular model of the classifier as an HTTP endpoint compliant with the [API specification](https://european-language-grid.readthedocs.io/en/release1.1.2/all/A2_API/LTInternalAPI.html) of the [European Language Grid](https://www.european-language-grid.eu).
+The `docker` directory contains configuration to build a Docker image running a particular model of the classifier as an HTTP endpoint compliant with the [API specification](https://european-language-grid.readthedocs.io/en/release1.1.2/all/A2_API/LTInternalAPI.html) of the [European Language Grid](https://www.european-language-grid.eu).
 
 ### Training new models
 To train new models, you can edit `train_model.py` (more support will be given in the future). To run:
 
@@ -1,27 +1,32 @@
-import sys
 import argparse
 import json
-from StanceClassifier.stance_classifier import StanceClassifier
+from StanceClassifier import stance_classifier
 import time
 
 
 
 #Load config: 
 #configurations = loadResources('../configurations.txt')
 
-parser = argparse.ArgumentParser(description='Multilingual Reply Only Stance classifier.')
-parser.add_argument('-s', help='stance file to be classified (json file)')
-
+parser = argparse.ArgumentParser(description='Stance classifier.', epilog='If only a reply is provided, the classifier will use the multilingual target-oblivious model.  If both a reply and a target are provided then the classifier will use the target-aware ensemble model, which is currently English-only.')
+parser.add_argument('reply', type=argparse.FileType('r'), help='JSON file with the reply tweet')
+parser.add_argument('target', type=argparse.FileType('r'), nargs='?', help='JSON file with the target tweet (optional)')
 
 
 args = parser.parse_args()
 
-v_args = vars(args)
 start_time = time.time()
-stance = json.load(open(v_args['s'], "r"))
+reply = json.load(args.reply)
 
-classifier = StanceClassifier()
-print("--- %s seconds ---" % (time.time() - start_time))
+if args.target:
+    classifier = stance_classifier.StanceClassifierEnsemble()
+    target = json.load(args.target)
+
+    print("--- %s seconds ---" % (time.time() - start_time))
+    print(classifier.classify_with_target(reply, target))
+else:
+    classifier = stance_classifier.StanceClassifier()
+    print("--- %s seconds ---" % (time.time() - start_time))
+    print(classifier.classify(reply)) # result
 
-print(classifier.classify(stance)) # result
 print("--- %s seconds ---" % (time.time() - start_time))
@@ -2,24 +2,31 @@
 # deny = 1
 # query = 2
 # comment = 3
+import functools
 
 import numpy as np
-from joblib import dump, load
 from transformers import AutoTokenizer
 import json
 import glob
 import re
-from StanceClassifier.util import Util, path_from_root
 import emoji
 from nltk.tokenize import TweetTokenizer
 import string
 
-class Features():
+class Features:
 
-    def __init__(self, tokenizer_PATH):
-    
-        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_PATH)
+    def __init__(self, tokenizer_PATH, tokenizer_kwargs=None, demojize=False):
+        if tokenizer_kwargs is None:
+            tokenizer_kwargs = {}
 
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_PATH, **tokenizer_kwargs)
+        # Wrap the tokenizer in an LRU cache so we don't need to re-tokenize the reply text
+        # twice when running in ensemble mode
+        self.tokenizer = functools.lru_cache(maxsize=10)(self.tokenizer)
+        self.demojize = demojize
+        self.nltk_tokenizer = TweetTokenizer()
+
+
     def process_tweet_dict(self, tweet_dict):
 
         if "text" not in tweet_dict.keys():
@@ -28,19 +35,31 @@ def process_tweet_dict(self, tweet_dict):
             text = tweet_dict["text"]
 
 
-        tknzr = TweetTokenizer()
         FLAGS = re.MULTILINE | re.DOTALL
         text = re.sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "http", text, flags=FLAGS)
         text = re.sub(r"@\w+", "@user", text, flags=FLAGS)
-        text_token = tknzr.tokenize(text)
+        text_token = self.nltk_tokenizer.tokenize(text)
         text = " ".join(text_token)
+
+        if self.demojize:
+            text = emoji.demojize(text)
 
         return text
 
-    def extract_bert_input(self, reply_tweet_dict):
-       
-        r_text = self.process_tweet_dict(reply_tweet_dict)
-        
+    def extract_bert_input(self, tweet_dict, text_pair=None):
+        """
+        Preprocess and tokenize the given tweet dict ready for the stance model.
+
+        :param tweet_dict: the tweet to process
+        :param text_pair: optional supplementary text to send to the tokenizer.  Typically this
+                will be omitted in the target-oblivious case, or when encoding just the reply tweet,
+                or it will be the text returned when preprocessing the reply, when preparing the
+                original tweet (the target to which it is replying) in a target-aware scenario.
+        :return: tuple with the encoded result of this tweet ready to pass to the model, and the
+                preprocessed text that could be passed as text_pair to encode the next tweet in the chain.
+        """
+        text = self.process_tweet_dict(tweet_dict)
+
         # input of target-oblivious model
-        encoded_reply = self.tokenizer(text=r_text, add_special_tokens=True, truncation=True, padding='max_length', max_length = 128, return_tensors="pt")
-        return encoded_reply
+        encoded = self.tokenizer(text=text, text_pair=text_pair, add_special_tokens=True, truncation=True, padding='max_length', max_length = 128, return_tensors="pt")
+        return encoded, text
@@ -1,32 +1,70 @@
 import os
 import json
 import sys
-from joblib import load
 import numpy as np
 from .features.extract_features import Features
-from .util import Util, path_from_root
 from .testing import test
 from transformers import AutoModelForSequenceClassification,AutoTokenizer
 
 
 
-class StanceClassifier():
+class StanceClassifier:
 
-    def __init__(self):
-        
-        RESOURCES_PATH = path_from_root("resources.txt")
+    def __init__(self, model="GateNLP/stance-twitter-xlm-target-oblivious", feature_extractor=None):
+        if not feature_extractor:
+            # Create a plain Features() instance loading its tokenizer from
+            # the same place as the model
+            feature_extractor = Features(model)
 
-        print("Loading resources")
-        util = Util()
-        self.resources = util.loadResources(RESOURCES_PATH)
-        self.feature_extractor = Features(path_from_root(self.resources["tokenizer"])) 
-        self.model = AutoModelForSequenceClassification.from_pretrained(path_from_root(self.resources["model"]), num_labels=4)
+        self.feature_extractor = feature_extractor
+        self.model = AutoModelForSequenceClassification.from_pretrained(model, num_labels=4)
 
 
     def classify(self, reply): 	
 
-        encoded_reply = self.feature_extractor.extract_bert_input(reply)
+        encoded_reply, _ = self.feature_extractor.extract_bert_input(reply)
         #print("stanceclassifier.classify....................", encoded_reply, encoded_source_reply)
         stance_class, stance_prob = test.predict_bertweet(encoded_reply, self.model)
 
-        return stance_class, stance_prob
+        return stance_class, stance_prob
+
+
+class StanceClassifierWithTarget(StanceClassifier):
+
+    def __init__(self, model="GateNLP/stance-bertweet-target-aware", feature_extractor=None):
+        if not feature_extractor:
+            feature_extractor = Features(model, tokenizer_kwargs={"use_fast": False}, demojize=True)
+
+        super().__init__(model, feature_extractor)
+
+    def classify_with_target(self, reply, target):
+        encoded_reply, reply_text = self.feature_extractor.extract_bert_input(reply)
+        encoded_reply_and_target, _ = self.feature_extractor.extract_bert_input(target, reply_text)
+
+        stance_class, stance_prob = test.predict_bertweet(encoded_reply_and_target, self.model)
+
+        return stance_class, stance_prob
+
+
+class StanceClassifierEnsemble:
+    """
+    Ensemble classifier that runs a target-oblivious and a target-aware model against
+    the same pair of posts and returns whichever prediction is more confident.
+    """
+
+    def __init__(self, to_model="GateNLP/stance-bertweet-target-oblivious", ta_model="GateNLP/stance-bertweet-target-aware", feature_extractor=None):
+        self.ta_classifier = StanceClassifierWithTarget(ta_model, feature_extractor)
+        # Use the same feature extractor for both classifiers, whether that's the supplied one
+        # or the one that was auto-created by the ta_classifier
+        self.to_classifier = StanceClassifier(to_model, self.ta_classifier.feature_extractor)
+
+
+    def classify_with_target(self, reply, target):
+        # run both the target oblivious and the target aware model, and return whichever gives
+        # the higher score
+        stance_class_to, stance_prob_to = self.to_classifier.classify(reply)
+        stance_class_ta, stance_prob_ta = self.ta_classifier.classify_with_target(reply, target)
+        if stance_prob_to[stance_class_to] > stance_prob_ta[stance_class_ta]:
+            return stance_class_to, stance_prob_to
+        else:
+            return stance_class_ta, stance_prob_ta
@@ -1,5 +1,4 @@
 import numpy as np
-from transformers import AutoModelForSequenceClassification
 import torch
 from scipy.special import softmax
 
@@ -23,9 +22,9 @@ def predict_bertweet(encoded_reply, model):
     return stance_prob, stance_prediction
 
 def process_model_output(output_): 
-    # input: logits of output_TO and output_TA;
+    # input: logits of output_
 
-    id2label = {0:"support", 1:"deny", 2:"query", 3:"comment"}
+    #id2label = {0:"support", 1:"deny", 2:"query", 3:"comment"}
     output_ = softmax(output_) # transform logits
     ranking_ = np.argsort(output_)[::-1] # rank
     #return output_[ranking_[0]], id2label[ranking_[0]]