Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

bemoody · 2025-09-28T00:01:59Z

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Previously, this required tricky logic to ensure that the iterator would make forward progress. In particular, if the previous match was an empty match at byte position J, the next search would start at position J+1.

That didn't work correctly if the regex was in UTF or UCP mode. For example, if the pattern was (?<=á) and the subject string was áá:

In "match-invalid-UTF" mode, trying to search at position 3 would fail to find the match at position 4 (because the byte at position 3 was regarded as invalid).
In "non-match-invalid-UTF" mode, trying to search at position 3 would give an error ("bad offset into UTF string").

PCRE2 has a mechanism to do what we really want: search for the next match that has start >= J and does not have start == end == J. This is done by setting the PCRE2_NOTEMPTY_ATSTART flag.

This PR is, I think, a better solution to the problem than my original PR #36. It avoids making assumptions about what the acceptable matching positions are, by deferring to PCRE2.

The internal find_at_with_match_data function is identical to captures_read_at, except that find_at_with_match_data expects a &mut MatchDataPoolGuard whereas captures_read_at expects a &mut CaptureLocations. find_at_with_match_data really only wants a &mut MatchData. The additional work of dereferencing a &mut MatchDataPoolGuard into a &mut MatchData can just as well be done by the caller. This lets us avoid implementing the same logic twice.

Instead of using captures_read_at, we can use an equivalent call to find_at_with_match_data.

Add a structure to allow setting "match-time" options when calling find_at_with_match_data.

This flag tells PCRE2 to ignore an empty match at the start position (but not a nonempty match at the start position, or an empty match later in the string), which can be useful for iterating.

In Matches and CaptureMatches, instead of trying to advance (after an empty match) by moving the start position one byte forward, simply tell PCRE2 to ignore any empty match at the starting position. In non-UTF mode, this should give the same behavior as before: - For the first iteration, any match is accepted. - After a non-empty match, if we see another non-empty match starting at last_end, that match is returned. - After a non-empty match, if we see an *empty* match starting at last_end, that match is ignored. (This is consistent with the behavior of the regex crate, although Perl and many other regex APIs do not behave this way.) - After an empty match, another (empty) match at the same position is not allowed; we're forced to advance by at least one byte, after which any empty or non-empty match is accepted. In UTF mode, this will work in the same way and will now correctly handle cases where the following character is multi-byte.

Test that find_iter and captures_iter correctly handle a mixture of empty and non-empty matches. The iterator should skip forward by a character following an empty match, and any empty match that occurs immediately after a non-empty match should be ignored. Look-behind assertions should work correctly in UCP mode after an empty match.

Benjamin Moody added 6 commits September 26, 2025 21:54

pcre2: implement CaptureMatches using find_at_with_match_data

41d763c

Instead of using captures_read_at, we can use an equivalent call to find_at_with_match_data.

pcre2: add options to find_at_with_match_data

dfcdabb

Add a structure to allow setting "match-time" options when calling find_at_with_match_data.

pcre2: add notempty_atstart to FindOptions

2eb9128

This flag tells PCRE2 to ignore an empty match at the start position (but not a nonempty match at the start position, or an empty match later in the string), which can be useful for iterating.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

Uh oh!

bemoody commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

Are you sure you want to change the base?

Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

Uh oh!

Conversation

bemoody commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant