tokenizer: capture non-ASCII identifiers #3748

smola · 2017-07-31T12:19:38Z

\w captures only ASCII letters and numbers. Changed to [[:alnum:]]
to capture any Unicode letter and digit. This makes the tokenizer
work properly on non-English based langiages (e.g. 1C Enterprise).

`\w` captures only ASCII letters and numbers. Changed to [[:alnum:]] to capture any Unicode letter and digit. This makes the tokenizer work properly on non-English based langiages (e.g. 1C Enterprise).

Alhadis · 2017-07-31T12:33:10Z

\w captures only ASCII letters and numbers.

That's dependent on the regex engine in question. Ruby uses Oniguruma, which is Unicode-aware by default, IIRC. Did you test to see if \w doesn't match word-characters in other alphabets?

smola · 2017-07-31T12:36:49Z

@Alhadis If I run the test I added without the fix, this is its output:

  1) Failure:
TestTokenizer#test_utf8_tokens [test/test_tokenizer.rb:118]:
Expected: ["Функция", "الكون"]
  Actual: []

pchaigno · 2017-07-31T12:37:58Z

@Alhadis [[:alnum:]] is safer in any case, no?

Alhadis · 2017-07-31T12:39:44Z

Possibly; it seems like Oniguruma only defaults to Unicode in Atom, but not Ruby... never mind then. :)

smola · 2017-12-02T11:44:54Z

Closing. I guess #3846 makes this obsolete.

tokenizer: capture non-ASCII identifiers

1333fc2

`\w` captures only ASCII letters and numbers. Changed to [[:alnum:]] to capture any Unicode letter and digit. This makes the tokenizer work properly on non-English based langiages (e.g. 1C Enterprise).

lildude mentioned this pull request Oct 5, 2017

Replace the tokenizer with a flex-based scanner #3846

Merged

smola closed this Dec 2, 2017

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tokenizer: capture non-ASCII identifiers #3748

tokenizer: capture non-ASCII identifiers #3748

Uh oh!

smola commented Jul 31, 2017

Uh oh!

Alhadis commented Jul 31, 2017 •

edited

Loading

Uh oh!

smola commented Jul 31, 2017

Uh oh!

pchaigno commented Jul 31, 2017

Uh oh!

Alhadis commented Jul 31, 2017

Uh oh!

smola commented Dec 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tokenizer: capture non-ASCII identifiers #3748

tokenizer: capture non-ASCII identifiers #3748

Uh oh!

Conversation

smola commented Jul 31, 2017

Uh oh!

Alhadis commented Jul 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smola commented Jul 31, 2017

Uh oh!

pchaigno commented Jul 31, 2017

Uh oh!

Alhadis commented Jul 31, 2017

Uh oh!

smola commented Dec 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alhadis commented Jul 31, 2017 •

edited

Loading