Skip to content

Conversation

@SHJordan
Copy link

@SHJordan SHJordan commented Dec 4, 2025

Description

This PR replaces usages of mb_split with preg_split in the Illuminate\Support\Str class.

Motivation

The mb_split function is part of the mbstring extension's regex module (mbregex). While mbstring is a requirement for Laravel, it is possible to compile mbstring without mbregex support (using --disable-mbregex). In such environments, or in certain PHP runtime configurations (e.g., specific MCP server environments), mb_split may be undefined even if mbstring is loaded.
preg_split is part of the PCRE extension, which is a core requirement of PHP and is always available. Using preg_split with the u (Unicode) modifier achieves the same result as mb_split for splitting strings by whitespace, making the framework more robust across different PHP builds.

Changes

  • Replaced mb_split('\s+', $value) with preg_split('/\s+/u', $value) in:
    • Str::headline()
    • Str::apa()
    • Str::studly()

Verification

I have verified that preg_split('/\s+/u', ...) correctly splits strings containing Unicode characters by whitespace, identical to mb_split('\s+', ...).

// Test script
$value = 'Application Info';
$parts = preg_split('/\s+/u', $value);
// Result: ['Application', 'Info']

@shaedrich
Copy link
Contributor

We shouldn't build our own solution when there's an extension explicitly taking care of multi-byte strings :/

@SHJordan
Copy link
Author

SHJordan commented Dec 4, 2025

We shouldn't build our own solution when there's an extension explicitly taking care of multi-byte strings :/

Good point! But if we can achieve the same result using core functions instead of php extensions wouldn't it be better in the long run?

@shaedrich
Copy link
Contributor

PHP doesn't have built-in multi-byte support, so no. We have to rely on an extension. And since Laravel wants \Illuminate\Support\Str to be both multi-byte-safe and consistent, I don't see a way around that.

@SHJordan
Copy link
Author

SHJordan commented Dec 4, 2025

Thanks for the feedback!

I understand the concern about relying on extensions, but I'd like to clarify a few technical points regarding "built-in" support and consistency:

  1. PCRE is Core: preg_split is part of the PCRE extension, which is bundled with PHP and cannot be disabled easily. It is as "built-in" as it gets for regex support in PHP.

  2. mbstring vs mbregex: The specific issue is that mb_split belongs to the mbregex module of mbstring. It is possible (and happens in some environments, like the one I encountered) to have mbstring enabled but mbregex disabled or missing. This causes Call to undefined function mb_split errors even when mb_strlen works fine.

  3. Consistency: The Illuminate\Support\Str class already heavily relies on PCRE for multibyte-safe operations. A quick search shows 36 usages of preg_* functions in this file alone, many using the /u (UTF-8) modifier to ensure multibyte safety (e.g., preg_match('/^[\pL\pM\pN]+$/u', ...) in isAscii).

  4. Safety: preg_split with the u modifier (/\s+/u) is a standard, robust, and multibyte-safe way to split strings by whitespace in PHP. It is widely considered a best practice when mbregex might not be available.

Given that the framework already trusts preg_* with /u for 90% of its regex needs, replacing these last few instances of mb_split seems to increase consistency and portability, rather than decreasing it.

Would you be open to reconsidering this to make the framework more robust across different PHP builds?

@taylorotwell
Copy link
Member

Thanks for your pull request to Laravel!

Unfortunately, I'm going to delay merging this code for now. To preserve our ability to adequately maintain the framework, we need to be very careful regarding the amount of code we include.

If applicable, please consider releasing your code as a package so that the community can still take advantage of your contributions!

@shaedrich
Copy link
Contributor

shaedrich commented Dec 4, 2025

1. PCRE is Core: preg_split is part of the PCRE extension, which is bundled with PHP and cannot be disabled easily. It is as "built-in" as it gets for regex support in PHP.

I never said that it isn't core, I said, it is not primarily(!) meant for multi-byte string handling and therefore should not expected to be the most bullet-proof solution. And even if it has fairly good multi-byte support, it terms of consistency (I know, point 3 on your list, but you only refer to one multi-byte use case, I rather mean the entirety of them), we should try to depend on one "tool" (multi-byte-first support extensions) rather than a patchwork rug of a multitude of different solutions for every different use case.

2. mbstring vs mbregex: The specific issue is that mb_split belongs to the mbregex module of mbstring. It is possible (and happens in some environments, like the one I encountered) to have mbstring enabled but mbregex disabled or missing. This causes Call to undefined function mb_split errors even when mb_strlen works fine.

Good point. Since Taylor closed this PR, maybe we should add this dependency to make sure it is installed.

There might be cases where there is no mb_* function. In this case, solutions fell back to PCRE, but this is a different scenario from the one here.

3. Consistency: The Illuminate\Support\Str class already heavily relies on PCRE for multibyte-safe operations. A quick search shows 36 usages of preg_* functions in this file alone, many using the /u (UTF-8) modifier to ensure multibyte safety (e.g., preg_match('/^[\pL\pM\pN]+$/u', ...) in isAscii). […] Given that the framework already trusts preg_* with /u for 90% of its regex needs, replacing these last few instances of mb_split seems to increase consistency and portability, rather than decreasing it.

The framework is not perfect. People constantly argue that because someone added something ten years back (I'm exaggerating here) and nobody noticed it to change it, it is the way to go for all eternity. What people have done, however, is gradually improving multi-byte support in the helper. But there's still room for even more improvement. That this takes time is the reality of open-source community-driven software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants