-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Closed
Description
In the README.md, it says for the pre-training:
It is important that these be actual sentences
for the "next sentence prediction" task
and the example sample_text.txt
does have each line ends with either .
or ;
.
Whereas in the BERT paper, it says
... we sample two spans of text from the corpus, which we refer to as "sentences"
even though they are typically much longer than single sentences
(but can be shorter also)
So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.
Metadata
Metadata
Assignees
Labels
No labels