Feature Request : Default SpaCy and UNK tokenization enabled for the diagrams. #31

ACE07-Sev · 2022-06-18T10:58:06Z

ACE07-Sev
Jun 18, 2022

Given tokenization will be present in all use-cases of NLP models, it would be efficient to have it set to True and enabled by default, as well as a SpaCy tokenizer since it would provide a more generalized model (used for indicating words such as he's and he is, they're and they are, and I'm and I am are the same for the model). These tokenizers are used in all use-cases hence it would provide a more efficient and enjoyable experience when using the models given having them built-in.

Furthermore, tokenizers for non ascii letters, numbers, and acronyms (such as idk,tbh,rn etc.) would pose as additional tokenization features, which can be used as additional bool params inside the parser.

dimkart · 2022-07-27T10:04:37Z

dimkart
Jul 27, 2022
Maintainer

Thanks for your comment @ACE07-Sev. Whether you need tokenisation or not depends strictly on the text data you process, for example, most NLP datasets have already undergone some pre-processing and come in a pre-tokenised form. Furthermore, the nature of the data may require a specialised tokenisation treatment, e.g. if your data come from twitter for example, the idiosyncratic nature of that medium might mean you will need a special tokeniser, or perhaps you have to create your own. Another example: the Japanese support PR that is open in the repo would mean that it's not possible to fix a tokeniser for all possible use cases of lambeq. So it's important that the user has the liberty to choose the tokeniser if they need it and in a way that makes the most of sense for them.

Regarding the second part of your post (specialised handling of non-ascii letters, numbers, acronyms etc), this is a good idea. We are thinking of adding a NLP module which will help with all these tasks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request : Default SpaCy and UNK tokenization enabled for the diagrams. #31

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Feature Request : Default SpaCy and UNK tokenization enabled for the diagrams. #31

Uh oh!

ACE07-Sev Jun 18, 2022

Replies: 1 comment

Uh oh!

Uh oh!

dimkart Jul 27, 2022 Maintainer

ACE07-Sev
Jun 18, 2022

dimkart
Jul 27, 2022
Maintainer