Skip to content

Fix character class range matching #570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 25, 2022

Conversation

hamishknight
Copy link
Contributor

Previously we performed a lexicographic comparison with the bounds of a character class range. However this produced surprising results, and our implementation didn't properly handle case sensitivity.

Update the logic to instead only allow single scalar NFC bounds. The input is then converted to NFC in grapheme semantic mode, and checked against the range. In scalar semantic mode, the input scalar is checked on its own. Additionally, fix the case sensitivity handling such that we check both the lowercase and uppercase version of the input against the range.

Resolves #401
Resolves #395
rdar://96898279

@hamishknight hamishknight changed the title Rip out unused _CharacterClassModel API Fix character class range matching Jul 12, 2022
@hamishknight hamishknight requested review from milseman and Azoy July 12, 2022 19:16
@hamishknight
Copy link
Contributor Author

Hrm, I thought the macOS CI would be new enough by now

@hamishknight hamishknight force-pushed the is-this-nfc branch 2 times, most recently from 00fdbb5 to 4a58679 Compare July 13, 2022 10:24
@hamishknight
Copy link
Contributor Author

So it turns out the macOS CI does have a new enough toolchain, it's just that when testing through swift test with a development toolchain, the OS stdlib is used. This differs from when testing within Xcode where it seems dyld is told to use the toolchain stdlib instead. Seems like it would be nice to figure out a way to get the macOS CI using the toolchain stdlib, but for now let's guard certain tests against older stdlibs.

@hamishknight
Copy link
Contributor Author

Splitting off the _CharacterClassModel work into #578

@milseman
Copy link
Member

We talked about this, and we want to do the following errors. Workaround for 1-2 is to use a scalar escape, which is clearer and more explicit anyways and doesn't have the bug potential (specially since copy-past might normalize to NFC on them, etc)

  1. Parse-time: error for any non-NFC literal content (using old stdlib SPI)
  2. Parse-time: error for any literal multi-scalar custom character class range bound
  3. Run-time compilation: error for any (even escaped) multi-scalar custom character class range bound

Replace a couple of `#if os(Linux)` checks with
a check to see if we have a newer stdlib
available. This lets us emit an expected failure
in the case where we're testing on an older
stdlib.
Previously we performed a lexicographic
comparison with the bounds of a character class
range. However this produced surprising results,
and our implementation didn't properly handle
case sensitivity.

Update the logic to instead only allow single
scalar NFC bounds. The input is then converted to
NFC in grapheme semantic mode, and checked against
the range. In scalar semantic mode, the input
scalar is checked on its own. Additionally, fix
the case sensitivity handling such that we check
both the lowercase and uppercase version of the
input against the range.
@hamishknight
Copy link
Contributor Author

@swift-ci please test

Copy link
Contributor

@Azoy Azoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you!

@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight merged commit b8a729c into swiftlang:main Jul 25, 2022
@hamishknight hamishknight deleted the is-this-nfc branch July 25, 2022 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Digit matching behaving as intended? Case insensitivity behavior of character class ranges differs from PCRE
3 participants