Skip to content

BUG: concat in pandas 1.4.2 unexpectedly drops rows or duplicates rows if multiIndex has pd.NA #48852

Closed
@Stephano120

Description

@Stephano120

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

First issue: unexpected dropped rows (concat issue), and values on the wrong row (??)

test_K  = pd.DataFrame([13, pd.NA], columns=['predK'], index=pd.MultiIndex.from_tuples([(4.0, 2.0),(pd.NA, pd.NA)], names=["first", "second"]))
test_L  = pd.DataFrame([14, 13], columns=['predL'], index=pd.MultiIndex.from_tuples([(pd.NA, pd.NA), (pd.NA, 3.0)], names=["first", "second"]))

pd.concat([test_K, test_L], axis=1).sort_index()
_____________________________________________________________________________________________________________________

Second issue: duplicated rows when multiIndex is like (pd.NA, pd.NA)

uuu_1
one  two
4.0  3.0    22
     4.0    44
NaN  1.0    14
     2.0     2
     4.0     6
     NaN    15
Name: providerA, dtype: int64


uuu_2
one  two
3.0  3.0     30
     4.0     53
4.0  2.0      8
     3.0     20
     4.0     99
NaN  NaN     12
Name: providerB, dtype: int64


pd.concat([uuu_1, uuu_2], axis=1).sort_index()
         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    NaN       15.0       12.0
    NaN       15.0       12.0

Issue Description

Regarding the first issue,

test_K is:

             predK
first second      
4.0   2.0       13
NaN   NaN     <NA>

test_L is:

              predL
first second       
NaN   NaN        14
      3.0        13

The output df is:

             predK  predL
first second             
4.0   2.0       13     **14**
NaN   NaN     <NA>     14

Question 1: Why is the row with index (pd.NA, 3.0) dropped ? And why does "14" appear on the first row?!


Regarding the second issue:
given that uuu_1 is:

one  two
4.0  3.0    22
     4.0    44
NaN  1.0    14
     2.0     2
     4.0     6
     NaN    15
Name: providerA, dtype: int64

and uuu_2 is:

one  two
3.0  3.0     30
     4.0     53
4.0  2.0      8
     3.0     20
     4.0     99
NaN  NaN     12
Name: providerB, dtype: int64

then ,
pd.concat([uuu_1, uuu_2], axis=1).sort_index()

gives:

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       12.0**
    **NaN       15.0       12.0**

Question 2: Why are are the two last rows duplicated?

A NaN is different from another NaN. When concat is involved, it seems that pandas is nevertheless able to put the values on a (pd.Na, pd.NA) row from input A and input B on the same row. Hence, here, we get "(pd.NA, pd.NA) 12" and "(pd.NA, pd.NA) 15" on the same row in the final output. I would have expected another behaviour but so far it's fine.

What I do not get, is why I end with duplicated rows in the output.

If I replace the NaN by strings, the problem is gone. But why do I get duplicated rows if I keep the pd.NA ? I would like to understand this.

Expected Behavior

Regarding the first issue, shouldn't we get an output like this:

             predK  predL
first second             
4.0   2.0       13     **NaN**
NaN   NaN     <NA>     14
NaN   3.0        <NA>     13

I dont get why the (pd.NA, 3.0) is dropped.
Since test_K and test_L are results of groupby, getting a final output in which there might be rows showing NaN values can be critical. It does not convey the same message as an output from which rows full of pd.NA or NaN values are dropped (which seems the case here)
And I would expect to see "NaN" rather than "14" on the first row (index(4.0. 2.0)) neither.


Regarding the second issue, shouldn't we end with either:

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       12.0**

or

         providerA  providerB
one two                      
3.0 3.0        NaN       30.0
    4.0        NaN       53.0
4.0 2.0        NaN        8.0
    3.0       22.0       20.0
    4.0       44.0       99.0
NaN 1.0       14.0        NaN
    2.0        2.0        NaN
    4.0        6.0        NaN
    **NaN       15.0       Nan**
    **NaN       NaN       12.0**

Installed Versions

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 28.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateMultiIndexReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions