-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
PDEP-14: Dedicated string data type for pandas 3.0 #58551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 20 commits
fbeb69d
f03f54d
561de87
86f4e51
30c7b43
54a43b3
5b5835b
9ede2e6
f5faf4e
f554909
ac2d21a
82027d2
5b24c24
f9c55f4
2c58c4c
0a68504
8974c5b
cca3a7f
d24a80a
9c5342a
b5663cc
1c4c2d9
c44bfb5
af5ad3c
bd52f39
f8fbc61
d78462d
4de20d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,373 @@ | ||||||
# PDEP-14: Dedicated string data type for pandas 3.0 | ||||||
|
||||||
- Created: May 3, 2024 | ||||||
- Status: Under discussion | ||||||
- Discussion: https://github.com/pandas-dev/pandas/pull/58551 | ||||||
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) | ||||||
- Revision: 1 | ||||||
|
||||||
## Abstract | ||||||
|
||||||
This PDEP proposes to introduce a dedicated string dtype that will be used by | ||||||
default in pandas 3.0: | ||||||
|
||||||
* In pandas 3.0, enable a string dtype (`"str"`) by default, using PyArrow if available | ||||||
or otherwise a string dtype using numpy object-dtype under the hood as fallback. | ||||||
* The default string dtype will use missing value semantics (using NaN) consistent | ||||||
with the other default data types. | ||||||
|
||||||
This will give users a long-awaited proper string dtype for 3.0, while 1) not | ||||||
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default, | ||||||
and 2) leaving room for future improvements (different missing value semantics, | ||||||
using NumPy 2.0 strings, etc). | ||||||
|
||||||
## Background | ||||||
|
||||||
Currently, pandas by default stores text data in an `object`-dtype NumPy array. | ||||||
The current implementation has two primary drawbacks. First, `object` dtype is | ||||||
not specific to strings: any Python object can be stored in an `object`-dtype | ||||||
array, not just strings, and seeing `object` as the dtype for a column with | ||||||
strings is confusing for users. Second: this is not efficient (all string | ||||||
methods on a Series are eventually calling Python methods on the individual | ||||||
string objects). | ||||||
|
||||||
To solve the first issue, a dedicated extension dtype for string data has | ||||||
already been | ||||||
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type). | ||||||
This has always been opt-in for now, requiring users to explicitly request the | ||||||
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing | ||||||
this string dtype was initially almost the same as the default implementation, | ||||||
i.e. an `object`-dtype NumPy array of Python strings. | ||||||
|
||||||
To solve the second issue (performance), pandas contributed to the development | ||||||
of string kernels in the PyArrow package, and a variant of the string dtype | ||||||
backed by PyArrow was | ||||||
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type). | ||||||
This could be specified with the `storage` keyword in the opt-in string dtype | ||||||
(`pd.StringDtype(storage="pyarrow")`). | ||||||
|
||||||
Since its introduction, the `StringDtype` has always been opt-in, and has used | ||||||
the experimental `pd.NA` sentinel for missing values (which was also [introduced | ||||||
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). | ||||||
However, up to this date, pandas has not yet taken the step to use `pd.NA` by | ||||||
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
to the default data types. | ||||||
|
||||||
In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) | ||||||
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0 | ||||||
(i.e. infer this type for string data instead of object dtype). To ensure we | ||||||
could use the variant of `StringDtype` backed by PyArrow instead of Python | ||||||
objects (for better performance), it proposed to make `pyarrow` a new required | ||||||
runtime dependency of pandas. | ||||||
|
||||||
In the meantime, NumPy has also been working on a native variable-width string | ||||||
data type, which will be available [starting with NumPy | ||||||
2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy). | ||||||
This can provide a potential alternative to PyArrow for implementing a string | ||||||
data type in pandas that is not backed by Python objects. | ||||||
|
||||||
After acceptance of PDEP-10, two aspects of the proposal have been under | ||||||
reconsideration: | ||||||
|
||||||
- Based on feedback from users and maintainers from other packages (mostly | ||||||
around installation complexity and size), it has been considered to relax the | ||||||
new `pyarrow` requirement to not be a _hard_ runtime dependency. In addition, | ||||||
NumPy 2.0 could in the future potentially reduce the need to make PyArrow a | ||||||
required dependency specifically for a dedicated pandas string dtype. | ||||||
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a | ||||||
consequence of adopting one of the existing implementations of the | ||||||
`StringDtype`. | ||||||
|
||||||
For the second aspect, another variant of the `StringDtype` was | ||||||
[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings) | ||||||
that is still backed by PyArrow but follows the default missing values semantics | ||||||
pandas uses for all other default data types (and using `NaN` as the missing | ||||||
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)). | ||||||
At the time, the `storage` option for this new variant was called | ||||||
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using | ||||||
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming" | ||||||
subsection below). | ||||||
|
||||||
This last dtype variant is what users currently (pandas 2.2) get for string data | ||||||
when enabling the ``future.infer_string`` option (to enable the behaviour which | ||||||
is intended to become the default in pandas 3.0). | ||||||
|
||||||
## Proposal | ||||||
|
||||||
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: | ||||||
|
||||||
1. For pandas 3.0, a `"str"` string dtype is enabled by default, which will use PyArrow | ||||||
if installed, and otherwise falls back to an in-house functionally-equivalent | ||||||
(but slower) version. | ||||||
2. This default string dtype will follow the same behaviour for missing values | ||||||
as other default data types, and use `NaN` as the missing value sentinel. | ||||||
3. The version that is not backed by PyArrow can reuse (with minor code | ||||||
additions) the existing numpy object-dtype backed StringArray for its | ||||||
implementation. | ||||||
4. Installation guidelines are updated to clearly encourage users to install | ||||||
pyarrow for the default user experience. | ||||||
|
||||||
Those string dtypes enabled by default will then no longer be considered as | ||||||
experimental. | ||||||
|
||||||
### Default inference of a string dtype | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
By default, pandas will infer this new string dtype instead of object dtype for | ||||||
string data (when creating pandas objects, such as in constructors or IO | ||||||
functions). | ||||||
|
||||||
In pandas 2.2, the existing `future.infer_string` option can be used to opt-in to the future | ||||||
default behaviour: | ||||||
|
||||||
```python | ||||||
>>> pd.options.future.infer_string = True | ||||||
>>> pd.Series(["a", "b", None]) | ||||||
0 a | ||||||
1 b | ||||||
2 NaN | ||||||
dtype: string | ||||||
``` | ||||||
|
||||||
Right now (pandas 2.2), the existing option only enables the PyArrow-based | ||||||
future dtype. For the remaining 2.x releases, this option will be expanded to | ||||||
also work when PyArrow is not installed to enable the object-dtype fallback in | ||||||
that case. | ||||||
|
||||||
### Missing value semantics | ||||||
|
||||||
As mentioned in the background section, the original `StringDtype` has always | ||||||
used the experimental `pd.NA` sentinel for missing values. In addition to using | ||||||
`pd.NA` as the scalar for a missing value, this essentially means that: | ||||||
|
||||||
- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics) | ||||||
for missing values, where `NA` propagates in boolean operations such as | ||||||
comparisons or predicates. | ||||||
- Operations on the string column that give a numeric or boolean result use the | ||||||
nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the | ||||||
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64` | ||||||
dtype (or `float64` in case of missing values)). | ||||||
|
||||||
However, up to this date, all other default data types still use `NaN` semantics | ||||||
for missing values. Therefore, this proposal says that a new default string | ||||||
dtype should also still use the same default missing value semantics and return | ||||||
default data types when doing operations on the string column, to be consistent | ||||||
with the other default dtypes at this point. | ||||||
|
||||||
In practice, this means that the default string dtype will use `NaN` as | ||||||
the missing value sentinel, and: | ||||||
|
||||||
- String columns will follow NaN-semantics for missing values, where `NaN` gives | ||||||
False in boolean operations such as comparisons or predicates. | ||||||
- Operations on the string column that give a numeric or boolean result will use | ||||||
the default data types (i.e. numpy `int64`/`float64`/`bool`). | ||||||
|
||||||
Because the original `StringDtype` implementations already use `pd.NA` and | ||||||
return masked integer and boolean arrays in operations, a new variant of the | ||||||
existing dtypes that uses `NaN` and default data types was needed. The original | ||||||
variant of `StringDtype` using `pd.NA` will continue to be available for those | ||||||
who were already using it. | ||||||
|
||||||
### Object-dtype "fallback" implementation | ||||||
|
||||||
To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep | ||||||
a "fallback" option in case PyArrow is not installed. The original `StringDtype` | ||||||
backed by a numpy object-dtype array of Python strings can be mostly reused for | ||||||
this (adding a new variant of the dtype) and a new `StringArray` subclass only | ||||||
needs minor changes to follow the above-mentioned missing value semantics | ||||||
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). | ||||||
|
||||||
For pandas 3.0, this is the most realistic option given this implementation has | ||||||
already been available for a long time. Beyond 3.0, further improvements such as | ||||||
using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) | ||||||
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can | ||||||
still be explored, but at that point that is an implementation detail that | ||||||
should not have a direct impact on users (except for performance). | ||||||
|
||||||
For the original variant of `StringDtype` using `pd.NA`, currently the default | ||||||
storage is `"python"` (the object-dtype based implementation). Also for this | ||||||
variant, it is proposed follow the same logic for determining the default | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
storage, i.e. the default to `"pyarrow"` if available, and otherwise | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
fall back to `"python"`. | ||||||
|
||||||
### Naming | ||||||
|
||||||
Given the long history of this topic, the naming of the dtypes is a difficult | ||||||
topic. | ||||||
|
||||||
In the first place, it should be acknowledged that most users should not need to | ||||||
use storage-specific options. Users are expected to specify a generic name (such | ||||||
as `"str"` or `"string"`), and that will give them their default string dtype | ||||||
(which depends on whether PyArrow is installed or not). | ||||||
|
||||||
For the generic string alias to specify the dtype, `"string"` is already used | ||||||
for the `StringDtype` using `pd.NA`. This PDEP proposes to use `"str"` for the | ||||||
new default `StringDtype` using `NaN`. This ensures backwards compatibility for | ||||||
code using `dtype="string"`, and was also chosen because `dtype="str"` or | ||||||
`dtype=str` currently already works to ensure your data is converted to | ||||||
strings (only using object dtype for the result). | ||||||
|
||||||
But for testing purposes and advanced use cases that want control over the exact | ||||||
variant of the `StringDtype`, we need some way to specify this and distinguish | ||||||
them from the other string dtypes. | ||||||
|
||||||
Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used for the new variant using `NaN`, | ||||||
where the `"pyarrow_numpy"` storage was used to disambiguate from the existing | ||||||
`"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing | ||||||
option and doesn't generalize well. Therefore, this PDEP proposes a new naming | ||||||
scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed | ||||||
before pandas 3.0. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The `storage` keyword of `StringDtype` is kept to disambiguate the underlying | ||||||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
storage of the string data (using pyarrow or python objects), but an additional | ||||||
`na_value` is introduced to disambiguate the the variants using NA semantics | ||||||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
and NaN semantics. | ||||||
|
||||||
Overview of the different ways to specify a dtype and the resulting concrete | ||||||
dtype of the data: | ||||||
|
||||||
| User specification | Concrete dtype | String alias | Note | | ||||||
|---------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------| | ||||||
| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) | | ||||||
| `"str"` or `StringDtype(na_value=np.nan)` | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) | | ||||||
| `StringDtype("pyarrow", na_value=np.nan)` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "str" | | | ||||||
| `StringDtype("python", na_value=np.nan)` | `StringDtype(storage="python", na_value=np.nan)` | "str" | | | ||||||
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | | | ||||||
| `StringDtype("python")` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | | | ||||||
Comment on lines
+235
to
+236
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can a user specify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, nothing changes there (it's mentioned in one of the paragraphs below the table that in theory we could also stop supporting the storage in the string alias for those, but that's out of scope for the PDEP). The fact that they are listed in the "String alias" column is meant to indicate that you can use that string as a string alias when specifying a dtype. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do Also, if we are saying that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I know, and I had been going back and forth in my draft calling this columm "String alias" vs "Dtype repr", because both are partly overlapping concepts, but then also "dtype repr" is confusing because there is both In the end, this column actually shows the dtype
As mentioned in my previous comment, a paragraph is included to say this is left for a separate discussion (this would introduce a backwards incompatible change for existing users, while otherwise are not any on this front, and the exact way to indicate backends in string aliases, I would like to defer any final decision about that to the PDEP on logical types) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea I find this really unfortunate but agree with Joris that we shouldn't rock the boat too much on this PDEP, especially since it is just about strings. The larger discussion on resolving those aliases should be had in PDEP-13 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe relabel "String alias" to "String alias for dtype input" ? Part of the confusion is this with 2.2: >>> pd.Series(["abc"], dtype="string[pyarrow]")
0 abc
dtype: string
>>> pd.Series(["abc"], dtype="string[pyarrow]").dtype
string[pyarrow]
>>> pd.Series(["abc"], dtype="string")
0 abc
dtype: string
>>> pd.Series(["abc"], dtype="string").dtype
string[python] When you print a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's the difference between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe showing the storage in both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @arnaudlegout it's definitely true that there are trade-offs in deciding which option is best (and for now, you will always be able to specify the storage and inspect it through an attribute). I think the intent at the moment, though, is to consider the default of pyarrow (if installed) as the general default that almost everyone is expected to use, and that we don't incentivize too much to switch to the non-default storage (i.e. that it is more considered as an implementation detail which storage is used, for most users). My personal take on showing the storage in the repr/str is that for most users, I expect that they shouldn't have to care about the exact storage (or even be aware of the concept), and therefore I think it is more confusing to show this by default in the What exactly the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done #59342 |
||||||
| `"string"` or `StringDtype()` | `StringDtype(storage="pyarrow"\|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) | | ||||||
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (2) | | ||||||
|
||||||
Notes: | ||||||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
- (1) You get "pyarrow" or "python" depending on pyarrow being installed. | ||||||
- (2) "pyarrow_numpy" is kept temporarily because this is already in a released | ||||||
version, but we can deprecate it in 2.x and have it removed for 3.0. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
For the new default string dtype, only the `"str"` alias can be used to | ||||||
specify the dtype as a string, i.e. pandas would not provide a way to make the | ||||||
underlying storage (pyarrow or python) explicit through the string alias. This | ||||||
string alias is only a convenience shortcut and for most users `"str"` is | ||||||
sufficient (they don't need to specify the storage), and the explicit | ||||||
`pd.StringDtype(storage=..., na_value=np.nan)` is still available for more | ||||||
fine-grained control. | ||||||
|
||||||
Also for the existing variant using `pd.NA`, specifying the storage through the | ||||||
string alias could be deprecated, but that is left for a separate decision. | ||||||
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Alternatives | ||||||
|
||||||
### Why not delay introducing a default string dtype? | ||||||
|
||||||
To avoid introducing a new string dtype while other discussions and changes are | ||||||
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
the default missing value sentinel? using the new NumPy 2.0 capabilities? | ||||||
overhauling all our dtypes to use a logical data type system?), introducing a | ||||||
default string dtype could also be delayed until there is more clarity in those | ||||||
other discussions. Specifically, it would avoid temporarily switching to use | ||||||
`NaN` for the string dtype, while in a future version we might switch back | ||||||
to `pd.NA` by default. | ||||||
|
||||||
However: | ||||||
|
||||||
1. Delaying has a cost: it further postpones introducing a dedicated string | ||||||
dtype that has significant benefits for users, both in usability as (for the | ||||||
part of the user base that has PyArrow installed) in performance. | ||||||
2. In case pandas eventually transitions to use `pd.NA` as the default missing value | ||||||
sentinel, a migration path for _all_ pandas data types will be needed, and thus | ||||||
the challenges around this will not be unique to the string dtype and | ||||||
therefore not a reason to delay this. | ||||||
|
||||||
Making this change now for 3.0 will benefit the majority of users, and the PDEP | ||||||
author believes this is worth the cost of the added complexity around "yet | ||||||
another dtype" (also for other data types we already have multiple variants). | ||||||
|
||||||
### Why not use the existing StringDtype with `pd.NA`? | ||||||
|
||||||
Wouldn't adding even more variants of the string dtype make things only more | ||||||
confusing? Indeed, this proposal unfortunately introduces more variants of the | ||||||
string dtype. However, the reason for this is to ensure the actual default user | ||||||
experience is _less_ confusing, and the new string dtype fits better with the | ||||||
other default data types. | ||||||
|
||||||
If the new default string data type would use `pd.NA`, then after some | ||||||
operations, a user can easily end up with a DataFrame that mixes columns using | ||||||
`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that | ||||||
could have columns with two different int64, two different float64, two different | ||||||
bool, etc dtypes). This would lead to a very confusing default experience. | ||||||
|
||||||
With the proposed new variant of the StringDtype, this will ensure that for the | ||||||
_default_ experience, a user will only see only 1 kind of integer dtype, only | ||||||
kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA` | ||||||
when explicitly opting into this. | ||||||
|
||||||
### Naming alternatives | ||||||
|
||||||
An initial version of this PDEP proposed to use the `"string"` alias and the | ||||||
default `pd.StringDtype()` class constructor for the new default dtype. | ||||||
However, that caused a lot of discussion around backwards compatibility for | ||||||
existing users of the `StringDtype` using `pd.NA`. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
During the discussion, several alternatives have been brought up. Both | ||||||
alternative keyword names as using a different constructor. In the end, | ||||||
this PDEP proposes to use a different string alias (`"str"`) but to keep | ||||||
using the existing `pd.StringDtype` (with the existing `storage` keyword but | ||||||
with an additional `na_value` keyword) for now to keep the changes as | ||||||
minimal as possible, leaving a larger overhaul of the dtype system (potentially | ||||||
including different constructor functions or namespace) for a future discussion. | ||||||
See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full | ||||||
discussion. | ||||||
|
||||||
One consequence is that when using the class constructor for the default dtype, | ||||||
it has to be used with non-default arguments, i.e. a user needs to specify | ||||||
`pd.StringDtype(na_value=np.nan)` to get the default dtype using `NaN`. | ||||||
Therefore, the pandas documentation will focus on the usage of `dtype="str"`. | ||||||
|
||||||
## Backward compatibility | ||||||
|
||||||
The most visible backwards incompatible change will be that columns with string | ||||||
data will no longer have an `object` dtype. Therefore, code that assumes | ||||||
`object` dtype (such as `ser.dtype == object`) will need to be updated. This | ||||||
change is done as a hard break in a major release, as warning in advance for the | ||||||
changed inference is deemed too noisy. | ||||||
|
||||||
To allow testing code in advance, the | ||||||
`pd.options.future.infer_string = True` option is available for users. | ||||||
|
||||||
Otherwise, the actual string-specific functionality (such as the `.str` accessor | ||||||
methods) should generally all keep working as is. | ||||||
|
||||||
By preserving the current missing value semantics, this proposal is also mostly | ||||||
backwards compatible on this aspect. When storing strings in object dtype, pandas | ||||||
however did allow using `None` as the missing value indicator as well (and in | ||||||
certain cases such as the `shift` method, pandas even introduced this itself). | ||||||
For all the cases where currently `None` was used as the missing value sentinel, | ||||||
this will change to use `NaN` consistently. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### For existing users of `StringDtype` | ||||||
|
||||||
Existing code that already opted in to use the `StringDtype` using `pd.NA` | ||||||
should generally keep working as is. The latest version of this PDEP preserves | ||||||
the behaviour of `dtype="string"` or `dtype=pd.StringDtype()` to mean the | ||||||
`pd.NA` variant of the dtype. | ||||||
|
||||||
It does propose the change the default storage to `"pyarrow"` (if available) for | ||||||
the opt-in `pd.NA` variant as well, but this should not have much user-visible | ||||||
impact. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Timeline | ||||||
|
||||||
The future PyArrow-backed string dtype was already made available behind a feature | ||||||
flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`). | ||||||
|
||||||
The variant using numpy object-dtype can also be backported to the 2.2.x branch | ||||||
to allow easier testing. It is proposed to release this as 2.3.0 (created from | ||||||
the 2.2.x branch, given that the main branch already includes many other changes | ||||||
targeted for 3.0), together with the changes to the naming scheme. | ||||||
|
||||||
The 2.3.0 release would then have all future string functionality available | ||||||
(both the pyarrow and object-dtype based variants of the default string dtype). | ||||||
|
||||||
For pandas 3.0, this `future.infer_string` flag becomes enabled by default. | ||||||
|
||||||
## PDEP-XX History | ||||||
|
||||||
- 3 May 2024: Initial version |
Uh oh!
There was an error while loading. Please reload this page.