Skip to content

Conversation

@irinakhismatullina
Copy link
Contributor

@irinakhismatullina irinakhismatullina commented Oct 29, 2019

Encoding happens exactly the same way it is done in YouTokenToMe.

  • Input - space-separated words. No normalization is performed
  • Each word is tokenized separately, special bow (or space) token is added to the beginning of each word.
  • Left-first order of tokenization (aaa -> aa a and not a aa).
  • Both numerical encoding and tokens are returned at once. This may be changed if needed.

Closes #4

Signed-off-by: Irina Khismatullina <[email protected]>
Signed-off-by: Irina Khismatullina <[email protected]>
@vmarkovtsev
Copy link
Collaborator

Good job @irinakhismatullina ! I like where we are heading towards.

Signed-off-by: Irina Khismatullina <[email protected]>
@vmarkovtsev vmarkovtsev merged commit 2feb9cc into src-d:master Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement encoding text with BPE model

2 participants