Replies: 1 comment
-
Thanks for your comment @ACE07-Sev. Whether you need tokenisation or not depends strictly on the text data you process, for example, most NLP datasets have already undergone some pre-processing and come in a pre-tokenised form. Furthermore, the nature of the data may require a specialised tokenisation treatment, e.g. if your data come from twitter for example, the idiosyncratic nature of that medium might mean you will need a special tokeniser, or perhaps you have to create your own. Another example: the Japanese support PR that is open in the repo would mean that it's not possible to fix a tokeniser for all possible use cases of lambeq. So it's important that the user has the liberty to choose the tokeniser if they need it and in a way that makes the most of sense for them. Regarding the second part of your post (specialised handling of non-ascii letters, numbers, acronyms etc), this is a good idea. We are thinking of adding a NLP module which will help with all these tasks. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Given tokenization will be present in all use-cases of NLP models, it would be efficient to have it set to True and enabled by default, as well as a SpaCy tokenizer since it would provide a more generalized model (used for indicating words such as he's and he is, they're and they are, and I'm and I am are the same for the model). These tokenizers are used in all use-cases hence it would provide a more efficient and enjoyable experience when using the models given having them built-in.
Furthermore, tokenizers for non ascii letters, numbers, and acronyms (such as idk,tbh,rn etc.) would pose as additional tokenization features, which can be used as additional bool params inside the parser.
Beta Was this translation helpful? Give feedback.
All reactions