Skip to content

gh-69456: Add method to detect if a string contains surrogates #135265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

StanFromIreland
Copy link
Contributor

@StanFromIreland StanFromIreland commented Jun 8, 2025

Request @bitdancer (and maybe @ezio-melotti and @malemburg (looking at devguide))

contains_surrogates is misleading since it does not require more than one, and to me, singular suffers less, though it can be misleading too.

And, should this also be implemented for bytes?


📚 Documentation preview 📚: https://cpython-previews--135265.org.readthedocs.build/

@bitdancer
Copy link
Member

Thanks for working on this!

_has_surrogates is an internal email library function name, and as such is not a suitable model for the method name for more than one reason. Importantly, it is a bit misleading to say we are detecting 'surrogates', since what we actually want the function to do is detect that a string contains surrogateescape encoded bytes. I'm a bit rusty on the python C code, but if I understand correctly surrogates only appear in the python internal unicode representation if there are in fact escaped bytes, so the function itself is probably correct. I'd like someone with fresher knowledge of the relevant C code to review, though.

But I can give my opinion on the name :) We probably want something like issurrogateescaped to harmonize with the analogous existing string function names. Specifically, this is analogous to istitle, which returns True if and only if the string has at least one capital letter and conforms to the title casing rules. In this case, we return True if and only if the string contains at least one escaped byte.

And no, there is no corresponding method for bytes, since by definition this is something that only has meaning in a unicode string, since surrogateescape is a decode error handler.

@ZeroIntensity
Copy link
Member

For what it's worth, changes to builtins (especially ones as prominent as str) need PEPs these days, because they'll have to be reflected across implementations. Perhaps @bitdancer would be willing to author/sponsor one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants