Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
First issue: unexpected dropped rows (concat issue), and values on the wrong row (??)
test_K = pd.DataFrame([13, pd.NA], columns=['predK'], index=pd.MultiIndex.from_tuples([(4.0, 2.0),(pd.NA, pd.NA)], names=["first", "second"]))
test_L = pd.DataFrame([14, 13], columns=['predL'], index=pd.MultiIndex.from_tuples([(pd.NA, pd.NA), (pd.NA, 3.0)], names=["first", "second"]))
pd.concat([test_K, test_L], axis=1).sort_index()
_____________________________________________________________________________________________________________________
Second issue: duplicated rows when multiIndex is like (pd.NA, pd.NA)
uuu_1
one two
4.0 3.0 22
4.0 44
NaN 1.0 14
2.0 2
4.0 6
NaN 15
Name: providerA, dtype: int64
uuu_2
one two
3.0 3.0 30
4.0 53
4.0 2.0 8
3.0 20
4.0 99
NaN NaN 12
Name: providerB, dtype: int64
pd.concat([uuu_1, uuu_2], axis=1).sort_index()
providerA providerB
one two
3.0 3.0 NaN 30.0
4.0 NaN 53.0
4.0 2.0 NaN 8.0
3.0 22.0 20.0
4.0 44.0 99.0
NaN 1.0 14.0 NaN
2.0 2.0 NaN
4.0 6.0 NaN
NaN 15.0 12.0
NaN 15.0 12.0
Issue Description
Regarding the first issue,
test_K is:
predK
first second
4.0 2.0 13
NaN NaN <NA>
test_L is:
predL
first second
NaN NaN 14
3.0 13
The output df is:
predK predL
first second
4.0 2.0 13 **14**
NaN NaN <NA> 14
Question 1: Why is the row with index (pd.NA, 3.0) dropped ? And why does "14" appear on the first row?!
Regarding the second issue:
given that uuu_1 is:
one two
4.0 3.0 22
4.0 44
NaN 1.0 14
2.0 2
4.0 6
NaN 15
Name: providerA, dtype: int64
and uuu_2 is:
one two
3.0 3.0 30
4.0 53
4.0 2.0 8
3.0 20
4.0 99
NaN NaN 12
Name: providerB, dtype: int64
then ,
pd.concat([uuu_1, uuu_2], axis=1).sort_index()
gives:
providerA providerB
one two
3.0 3.0 NaN 30.0
4.0 NaN 53.0
4.0 2.0 NaN 8.0
3.0 22.0 20.0
4.0 44.0 99.0
NaN 1.0 14.0 NaN
2.0 2.0 NaN
4.0 6.0 NaN
**NaN 15.0 12.0**
**NaN 15.0 12.0**
Question 2: Why are are the two last rows duplicated?
A NaN is different from another NaN. When concat is involved, it seems that pandas is nevertheless able to put the values on a (pd.Na, pd.NA) row from input A and input B on the same row. Hence, here, we get "(pd.NA, pd.NA) 12" and "(pd.NA, pd.NA) 15" on the same row in the final output. I would have expected another behaviour but so far it's fine.
What I do not get, is why I end with duplicated rows in the output.
If I replace the NaN by strings, the problem is gone. But why do I get duplicated rows if I keep the pd.NA ? I would like to understand this.
Expected Behavior
Regarding the first issue, shouldn't we get an output like this:
predK predL
first second
4.0 2.0 13 **NaN**
NaN NaN <NA> 14
NaN 3.0 <NA> 13
I dont get why the (pd.NA, 3.0) is dropped.
Since test_K and test_L are results of groupby, getting a final output in which there might be rows showing NaN values can be critical. It does not convey the same message as an output from which rows full of pd.NA or NaN values are dropped (which seems the case here)
And I would expect to see "NaN" rather than "14" on the first row (index(4.0. 2.0)) neither.
Regarding the second issue, shouldn't we end with either:
providerA providerB
one two
3.0 3.0 NaN 30.0
4.0 NaN 53.0
4.0 2.0 NaN 8.0
3.0 22.0 20.0
4.0 44.0 99.0
NaN 1.0 14.0 NaN
2.0 2.0 NaN
4.0 6.0 NaN
**NaN 15.0 12.0**
or
providerA providerB
one two
3.0 3.0 NaN 30.0
4.0 NaN 53.0
4.0 2.0 NaN 8.0
3.0 22.0 20.0
4.0 44.0 99.0
NaN 1.0 14.0 NaN
2.0 2.0 NaN
4.0 6.0 NaN
**NaN 15.0 Nan**
**NaN NaN 12.0**
Installed Versions
pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 28.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None