Skip to content

Conversation

@bemoody
Copy link

@bemoody bemoody commented Sep 28, 2025

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Previously, this required tricky logic to ensure that the iterator would make forward progress. In particular, if the previous match was an empty match at byte position J, the next search would start at position J+1.

That didn't work correctly if the regex was in UTF or UCP mode. For example, if the pattern was (?<=á) and the subject string was áá:

  • In "match-invalid-UTF" mode, trying to search at position 3 would fail to find the match at position 4 (because the byte at position 3 was regarded as invalid).

  • In "non-match-invalid-UTF" mode, trying to search at position 3 would give an error ("bad offset into UTF string").

PCRE2 has a mechanism to do what we really want: search for the next match that has start >= J and does not have start == end == J. This is done by setting the PCRE2_NOTEMPTY_ATSTART flag.

This PR is, I think, a better solution to the problem than my original PR #36. It avoids making assumptions about what the acceptable matching positions are, by deferring to PCRE2.

Benjamin Moody added 6 commits September 26, 2025 21:54
The internal find_at_with_match_data function is identical to
captures_read_at, except that find_at_with_match_data expects a
&mut MatchDataPoolGuard whereas captures_read_at expects a
&mut CaptureLocations.

find_at_with_match_data really only wants a &mut MatchData.  The
additional work of dereferencing a &mut MatchDataPoolGuard into a
&mut MatchData can just as well be done by the caller.  This lets us
avoid implementing the same logic twice.
Instead of using captures_read_at, we can use an equivalent call to
find_at_with_match_data.
Add a structure to allow setting "match-time" options when calling
find_at_with_match_data.
This flag tells PCRE2 to ignore an empty match at the start
position (but not a nonempty match at the start position, or an empty
match later in the string), which can be useful for iterating.
In Matches and CaptureMatches, instead of trying to advance (after an
empty match) by moving the start position one byte forward, simply
tell PCRE2 to ignore any empty match at the starting position.

In non-UTF mode, this should give the same behavior as before:

- For the first iteration, any match is accepted.

- After a non-empty match, if we see another non-empty match starting
  at last_end, that match is returned.

- After a non-empty match, if we see an *empty* match starting at
  last_end, that match is ignored.  (This is consistent with the
  behavior of the regex crate, although Perl and many other regex APIs
  do not behave this way.)

- After an empty match, another (empty) match at the same position is
  not allowed; we're forced to advance by at least one byte, after
  which any empty or non-empty match is accepted.

In UTF mode, this will work in the same way and will now correctly
handle cases where the following character is multi-byte.
Test that find_iter and captures_iter correctly handle a mixture of
empty and non-empty matches.  The iterator should skip forward by a
character following an empty match, and any empty match that occurs
immediately after a non-empty match should be ignored.

Look-behind assertions should work correctly in UCP mode after an
empty match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant