Does it make sense to use a compositional approach for social media language ? #39
-
Hello lambeq dev team, I have in mind a preprocessing stage where I try to correct the spellings, remove tags and URLs, convert slangs to regular English, remove/convert emojis in regular text. Differently from the usual NLP pipeline, however, I'm not planning to remove stopwords and perform stem and lemmatization because I see you usually feed "correct" sentence as they are and then process them in the diagram rewriting stage. Could you also please suggest a proper ansatz and its configuration for the above task ? Kind regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Hi @mspronesti, in general, using syntax in social media, such as twitter, is not recommended due to the special nature of the text (emojis, hashtags, abbreviations, urls). I would recommend using one of the linear readers in lambeq, such |
Beta Was this translation helpful? Give feedback.
Hi @mspronesti, in general, using syntax in social media, such as twitter, is not recommended due to the special nature of the text (emojis, hashtags, abbreviations, urls). I would recommend using one of the linear readers in lambeq, such
cups_reader
(with removed swaps),stairs_readers
, or evenspiders_reader
. Regarding pre-processing, do not bother to "correct the spelling" -- this by itself is a separate task and can't be done reliably. My advice is (if you use one of the linear readers), to include emojis as separate tokens in your vocabulary -- they have a special meaning in social media and actually are very useful in determining the "meaning" of a sentence. There are not strict rul…