Skip to content

Conversation

@smola
Copy link
Contributor

@smola smola commented Jul 31, 2017

\w captures only ASCII letters and numbers. Changed to [[:alnum:]]
to capture any Unicode letter and digit. This makes the tokenizer
work properly on non-English based langiages (e.g. 1C Enterprise).

`\w` captures only ASCII letters and numbers. Changed to [[:alnum:]]
to capture any Unicode letter and digit. This makes the tokenizer
work properly on non-English based langiages (e.g. 1C Enterprise).
@Alhadis
Copy link
Collaborator

Alhadis commented Jul 31, 2017

\w captures only ASCII letters and numbers.

That's dependent on the regex engine in question. Ruby uses Oniguruma, which is Unicode-aware by default, IIRC. Did you test to see if \w doesn't match word-characters in other alphabets?

@smola
Copy link
Contributor Author

smola commented Jul 31, 2017

@Alhadis If I run the test I added without the fix, this is its output:

  1) Failure:
TestTokenizer#test_utf8_tokens [test/test_tokenizer.rb:118]:
Expected: ["Функция", "الكون"]
  Actual: []

@pchaigno
Copy link
Contributor

@Alhadis [[:alnum:]] is safer in any case, no?

@Alhadis
Copy link
Collaborator

Alhadis commented Jul 31, 2017

Possibly; it seems like Oniguruma only defaults to Unicode in Atom, but not Ruby... never mind then. :)

@smola
Copy link
Contributor Author

smola commented Dec 2, 2017

Closing. I guess #3846 makes this obsolete.

@smola smola closed this Dec 2, 2017
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants