Fix character class range matching #570

hamishknight · 2022-07-12T19:15:50Z

Previously we performed a lexicographic comparison with the bounds of a character class range. However this produced surprising results, and our implementation didn't properly handle case sensitivity.

Update the logic to instead only allow single scalar NFC bounds. The input is then converted to NFC in grapheme semantic mode, and checked against the range. In scalar semantic mode, the input scalar is checked on its own. Additionally, fix the case sensitivity handling such that we check both the lowercase and uppercase version of the input against the range.

Resolves #401
Resolves #395
rdar://96898279

hamishknight · 2022-07-12T19:24:59Z

Hrm, I thought the macOS CI would be new enough by now

hamishknight · 2022-07-13T10:24:41Z

So it turns out the macOS CI does have a new enough toolchain, it's just that when testing through swift test with a development toolchain, the OS stdlib is used. This differs from when testing within Xcode where it seems dyld is told to use the toolchain stdlib instead. Seems like it would be nice to figure out a way to get the macOS CI using the toolchain stdlib, but for now let's guard certain tests against older stdlibs.

hamishknight · 2022-07-14T16:56:34Z

Splitting off the _CharacterClassModel work into #578

milseman · 2022-07-14T17:23:49Z

We talked about this, and we want to do the following errors. Workaround for 1-2 is to use a scalar escape, which is clearer and more explicit anyways and doesn't have the bug potential (specially since copy-past might normalize to NFC on them, etc)

Parse-time: error for any non-NFC literal content (using old stdlib SPI)
Parse-time: error for any literal multi-scalar custom character class range bound
Run-time compilation: error for any (even escaped) multi-scalar custom character class range bound

Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift

Replace a couple of `#if os(Linux)` checks with a check to see if we have a newer stdlib available. This lets us emit an expected failure in the case where we're testing on an older stdlib.

Previously we performed a lexicographic comparison with the bounds of a character class range. However this produced surprising results, and our implementation didn't properly handle case sensitivity. Update the logic to instead only allow single scalar NFC bounds. The input is then converted to NFC in grapheme semantic mode, and checked against the range. In scalar semantic mode, the input scalar is checked on its own. Additionally, fix the case sensitivity handling such that we check both the lowercase and uppercase version of the input against the range.

hamishknight · 2022-07-19T10:42:35Z

@swift-ci please test

Azoy

This looks great, thank you!

hamishknight · 2022-07-25T17:24:47Z

@swift-ci please test

hamishknight changed the title ~~Rip out unused _CharacterClassModel API~~ Fix character class range matching Jul 12, 2022

hamishknight requested review from milseman and Azoy July 12, 2022 19:16

This was referenced Jul 12, 2022

Limit custom character class ranges to single scalars #422

Open

[\n] should not match \r\n #568

Closed

hamishknight force-pushed the is-this-nfc branch 2 times, most recently from 00fdbb5 to 4a58679 Compare July 13, 2022 10:24

hamishknight force-pushed the is-this-nfc branch from 4a58679 to 5c6adcd Compare July 18, 2022 18:55

hamishknight commented Jul 18, 2022

View reviewed changes

Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift Outdated Show resolved Hide resolved

hamishknight force-pushed the is-this-nfc branch from 5c6adcd to 087377b Compare July 18, 2022 19:09

hamishknight mentioned this pull request Jul 18, 2022

Better coalesce adjacent scalars #574

Merged

hamishknight added 3 commits July 19, 2022 11:41

Guard against testing with older stdlibs

148ccbc

Replace a couple of `#if os(Linux)` checks with a check to see if we have a newer stdlib available. This lets us emit an expected failure in the case where we're testing on an older stdlib.

Add some extra character class newline matching tests

8d64450

hamishknight force-pushed the is-this-nfc branch from 087377b to cd5cc37 Compare July 19, 2022 10:42

hamishknight mentioned this pull request Jul 19, 2022

[5.7] Character class and scalar coalescing fixes #588

Merged

Azoy approved these changes Jul 25, 2022

View reviewed changes

hamishknight merged commit b8a729c into swiftlang:main Jul 25, 2022

hamishknight deleted the is-this-nfc branch July 25, 2022 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix character class range matching #570

Fix character class range matching #570

Uh oh!

hamishknight commented Jul 12, 2022

Uh oh!

hamishknight commented Jul 12, 2022

Uh oh!

hamishknight commented Jul 13, 2022

Uh oh!

hamishknight commented Jul 14, 2022

Uh oh!

milseman commented Jul 14, 2022

Uh oh!

Uh oh!

hamishknight commented Jul 19, 2022

Uh oh!

Azoy left a comment

Uh oh!

hamishknight commented Jul 25, 2022

Uh oh!

Uh oh!

Fix character class range matching #570

Fix character class range matching #570

Uh oh!

Conversation

hamishknight commented Jul 12, 2022

Uh oh!

hamishknight commented Jul 12, 2022

Uh oh!

hamishknight commented Jul 13, 2022

Uh oh!

hamishknight commented Jul 14, 2022

Uh oh!

milseman commented Jul 14, 2022

Uh oh!

Uh oh!

hamishknight commented Jul 19, 2022

Uh oh!

Azoy left a comment

Choose a reason for hiding this comment

Uh oh!

hamishknight commented Jul 25, 2022

Uh oh!

Uh oh!