Skip to content

Conversation

@l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Dec 26, 2023

Continuing #76 (and as an alternative to the optimized regular expressions in #75), here the '(?:[sdmt]|ll|ve|re)|[^\r\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\r\n]*|\\s*[\r\n]|\\s+(?!\\S)|\\s+ regex for the cl100k parser is completely eliminated and replaced with a custom character-by-character parser for each segment with optimized unicode category detections.

First category is the contractions, followed by words, numbers, punctuation and whitespace.

The start and end indexes are collected and when a match is found we convert it to UFT8 to a reusable array which is presented to the fragmentConsumer (which will attempt to tokenize it).

The last segment, the whitespaces are first completely consumed, split by new lines and if after the last (non-newline) whitespace there's a non-whitespace (e.g. a "\n a"), we pop off the last space for the next token.

The isLetter, isNumeric, isLetterOrNumeric, isWhitespace, isNewline, isNotWhitespaceOrLetterOrNumeric, isNotNewlineOrLetterOrNumeric helpers are highly optimized (detecting the common cases first, before doing the heavy calculations) to detect if the next character category is a match or not.

addUtf8Bytes needed to be reimplemented since I couldn't find any other way to convert it directly to a reusable list (which avoids creating so much temporary garbage).

Lastly, to avoid all the boxing of primitives, I've added a simple growable byte[] and int[] backed list for the utf8 bytes and the tokens themselves.

Before:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  4.470 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  3.799 ± 0.020   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  5.290 ± 0.067   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  5.308 ± 0.086   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.962 ± 0.096   s/op

After:

Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.315 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.095 ± 0.030   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.266 ± 0.027   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.298 ± 0.120   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  3.914 ± 0.026   s/op

As usual, I recommend reviewing the PR commit-by-commit for the changes to make sense:
image

After the changes both the small and large token encoders became a lot faster, so now the VERY_LARGE_TOKENIZER_BYTE_THRESHOLD is at 500:
image

@l0rinc l0rinc requested a review from tox-p as a code owner December 26, 2023 20:59
@l0rinc l0rinc force-pushed the optimize-token-splitter branch 3 times, most recently from eddff1e to 18da404 Compare December 26, 2023 21:17
@l0rinc l0rinc changed the title Add a dedicated fragment splitter for the cl100k encoding 3) Add a dedicated fragment splitter for the cl100k encoding Dec 27, 2023
@l0rinc l0rinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.2s Dec 27, 2023
@l0rinc l0rinc force-pushed the optimize-token-splitter branch from 66c6ca7 to 3973ffd Compare December 27, 2023 19:32
@l0rinc l0rinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.2s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s Dec 27, 2023
…e piece

Since whole pieces were already checked, we don't have to try to reencode them

Before:
Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.365 ± 0.019   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.130 ± 0.024   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.393 ± 0.026   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.408 ± 0.015   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.073 ± 0.017   s/op

After:
Benchmark                                              (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                        data    ss   10  2.340 ± 0.023   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount              data    ss   10  2.096 ± 0.029   s/op
SingleThreadedBenchmark.benchmarkP50kBase                          data    ss   10  4.385 ± 0.017   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                          data    ss   10  4.372 ± 0.041   s/op
SingleThreadedBenchmark.benchmarkR50kBase                          data    ss   10  4.059 ± 0.026   s/op
@l0rinc l0rinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2s Jan 4, 2024
@l0rinc l0rinc changed the title 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2s 3) Add a dedicated fragment splitter for the cl100k encoding - 3.8s to 2.1s Jan 4, 2024
Lőrinc added 2 commits January 6, 2024 20:10
…count

We're also skipping the last getMinRankIndex calculation when we have 2 remaining tokens
@l0rinc l0rinc force-pushed the optimize-token-splitter branch from 852f440 to 0411aaa Compare January 8, 2024 14:04
@l0rinc l0rinc force-pushed the optimize-token-splitter branch 2 times, most recently from e3534f5 to fce667a Compare January 12, 2024 13:10
@l0rinc l0rinc force-pushed the optimize-token-splitter branch 2 times, most recently from b026dee to fad6ece Compare January 12, 2024 21:10
@l0rinc l0rinc force-pushed the optimize-token-splitter branch from fad6ece to d04d660 Compare January 12, 2024 21:11
Copy link
Contributor

@tox-p tox-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) see my in-line comment for a question how you want to proceed, but after that we can get this merged

fetchUnicodeData().entrySet().stream().parallel().forEach(e -> {
var expected = Character.toString(e.getKey());
if (isValidUTF8(expected)) {
var dst = new ByteArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that at the point of this commit (89d6212) this class did not exist yet. I personally do not care whether this commit compiles on its own, but since you purposefully split your contributions into meaningful commits, you may prefer to have each commit in a compiling state? Let me know whether you want me to merge it this way or whether you want to re-organize your commits first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was likely the result of moving a change back to a previous commit.
Rebasing would mess with the comments, I usually only rebase before reviews.
I'm fine with merging as is, if you are.

@tox-p tox-p merged commit 4de1e4e into knuddelsgmbh:main Feb 4, 2024
@l0rinc l0rinc mentioned this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants