Does it make sense to use a compositional approach for social media language ? #39

mspronesti · 2022-08-08T08:00:30Z

mspronesti
Aug 8, 2022

Hello lambeq dev team,
in the past few days I saw a couple of quantum hackatons where the organizers proposed the usual twitter sentiment analysis task, in a quantum fashion, using lambeq. Considering that all the examples I saw around related to this library usually have datasets of short and grammarly correct sentences - and, also, considering that state-of-the-art NLP models don't yield great results if not pre-trained on tweets and accurately fine tuned - I was wondering if using a compositional approach towards these kind of tasks make sense to you.

I have in mind a preprocessing stage where I try to correct the spellings, remove tags and URLs, convert slangs to regular English, remove/convert emojis in regular text. Differently from the usual NLP pipeline, however, I'm not planning to remove stopwords and perform stem and lemmatization because I see you usually feed "correct" sentence as they are and then process them in the diagram rewriting stage.

Could you also please suggest a proper ansatz and its configuration for the above task ?

Kind regards,
Massimiliano

Answered by dimkart

Aug 11, 2022

Hi @mspronesti, in general, using syntax in social media, such as twitter, is not recommended due to the special nature of the text (emojis, hashtags, abbreviations, urls). I would recommend using one of the linear readers in lambeq, such cups_reader (with removed swaps), stairs_readers, or even spiders_reader. Regarding pre-processing, do not bother to "correct the spelling" -- this by itself is a separate task and can't be done reliably. My advice is (if you use one of the linear readers), to include emojis as separate tokens in your vocabulary -- they have a special meaning in social media and actually are very useful in determining the "meaning" of a sentence. There are not strict rul…

View full answer

dimkart · 2022-08-11T11:30:31Z

dimkart
Aug 11, 2022
Maintainer

Hi @mspronesti, in general, using syntax in social media, such as twitter, is not recommended due to the special nature of the text (emojis, hashtags, abbreviations, urls). I would recommend using one of the linear readers in lambeq, such cups_reader (with removed swaps), stairs_readers, or even spiders_reader. Regarding pre-processing, do not bother to "correct the spelling" -- this by itself is a separate task and can't be done reliably. My advice is (if you use one of the linear readers), to include emojis as separate tokens in your vocabulary -- they have a special meaning in social media and actually are very useful in determining the "meaning" of a sentence. There are not strict rules though, to understand what exactly you want to keep and what to remove might require some trial and error depending on your dataset -- it's more art than a science :)

6 replies

dimkart Aug 12, 2022
Maintainer

Changed the results in a good or a bad way? I suggested linear readers (such as stairs readers, cups reader and spiders reader) since they don't require syntax, which might be easier for social media text. These readers just connect the words in a sequence, like an RNN.

Re including emoticons, this will only work using a linear reader: For each emoticon, add a separate token in your vocabulary, and then treat each emoticon as another word. E.g. imagine the sentence:

Couldn't sleep last night 😞 I'm exhausted

Using a linear reader your "sentence" could have the following form:

   <S>    Couldn't  sleep   last    night  <EMOJ1>  I'm    exhausted
    o--------o--------o------o-------o--------o------o--------o------

Hope this helps.

mspronesti Aug 12, 2022
Author

Hello @dimkart, thanks for your usual fast and helpful answers!

Changed the results in a good or a bad way?

It changed the result in a good way. However, I'm wondering, how does the compositional approach impact this problem if syntax is no longer a requirement ? I thought the strength of this approach lied in the "understanding" of the grammatical meaning of a sentence.

Re including emoticons, this will only work using a linear reader: For each emoticon, add a separate token in your vocabulary, and then treat each emoticon as another word

So you're suggesting to, in a sense, associate each emoji and emoticon to a word ?
For instance, I might have a dictionary of emojis which associates to the one you used in your example the word "tired" and for this one 😄 the word "happy" and so on, like they do in libraries like emot. Is this what you mean ?

dimkart Aug 15, 2022
Maintainer

Regarding your first question, not all tasks have the same requirements. For example, for the "meaning classification" task we presented in the "QNLP in practice" paper, a simple examination of the words in isolation is sufficient -- e.g. if the sentence contains one or more words from Set 1, it is related to cooking, if it contains a word from Set 2, it is related to IT. This is what a bag-of-words model does, such as the spiders reader. There is still a compositional relationship there, a really loose one. The order of the words and other syntax features are simply not necessary, and a model that is more complicated than required for a task, might also be more difficult to be trained properly and could need more data. So, I'm guessing something similar happens to your case (although I do not know details). Syntax can be useful in more complicated tasks though (such as the RP task from the same paper).

Regarding your second question, you don't need to do something that fancy: just keep a running index, and increase it every time you meet a new emoticon, e.g. keeping a dictionary from emoticon to indexed labels. Then use these labels (EMOJ01, EMOJ02 etc) as another word in the vocabulary. Or just use directly the str representation of the unicode code as the label. Hope this makes sense.

mspronesti Aug 16, 2022
Author

Thanks a lot for your answer @dimkart . I only have one final question: if we eventually try to apply a compositional approach towards this problem, but, essentially, we discard grammatical relations between words, what is the advantage wrt to the classical bag-of-words ?
Is it that these models (even the classical ones you developed) are less data hungry and less complex while the "non-compositional" classical models are rather complex and usually require huge data loads ?

dimkart Aug 19, 2022
Maintainer

Well, computation on a quantum machine is inherently different than a classical computation. It is probabilistic, and is progressing through properties such as entanglement and interference, which are not present in the classical case. The reduction in space complexity you mentioned is also something that comes for free in QC. Admittedly though, it's still quite soon to talk about a clear "quantum advantage", and I would advise to not pay too much attention to such claims when you see them.

Something else I wanted to mention is that you seem to believe that "compositionality" is only a property of quantum models. Although it's true that compositionality manifests itself very strongly in QC, it's also present (to different degrees) to almost every classical NLP model as well -- for example, all deep learning models are compositional, although not exactly in the same way.

Does it make sense to use a compositional approach for social media language ? #39

Uh oh!

Uh oh!

mspronesti Aug 8, 2022

Replies: 1 comment · 6 replies

Uh oh!

dimkart Aug 11, 2022 Maintainer

Uh oh!

Uh oh!

dimkart Aug 12, 2022 Maintainer

Uh oh!

Uh oh!

mspronesti Aug 12, 2022 Author

Uh oh!

Uh oh!

dimkart Aug 15, 2022 Maintainer

Uh oh!

mspronesti Aug 16, 2022 Author

Uh oh!

dimkart Aug 19, 2022 Maintainer

mspronesti
Aug 8, 2022

Replies: 1 comment 6 replies

dimkart
Aug 11, 2022
Maintainer

dimkart Aug 12, 2022
Maintainer

mspronesti Aug 12, 2022
Author

dimkart Aug 15, 2022
Maintainer

mspronesti Aug 16, 2022
Author

dimkart Aug 19, 2022
Maintainer