Skip to content

Inconsistent index in result of groupby apply  #30533

@fujiaxiang

Description

@fujiaxiang

I'm not really sure if this is an "issue" or the intended behavior, but it just seems unnatural to me to get this result.

import pandas as pd

df = pd.DataFrame({
    'a': [1, 1, 2, 2],
    'b': [3, 3, 4, 5],
})
df

   a  b
0  1  3
1  1  3
2  2  4
3  2  5
df1 = df.groupby('a').apply(lambda x: x)
df1

   a  b
0  1  3
1  1  3
2  2  4
3  2  5
df2 = df.groupby('a').apply(lambda x: x.copy())
df2

     a  b
a
1 0  1  3
  1  1  3
2 2  2  4
  3  2  5

Notice that the two produces different results.
A simple code trace tells me this difference comes from BaseGrouper.apply which return mutated=False for .apply(lambda x: x) and mutate=True for .apply(lambda x: x.copy())

This could cause unwanted result when we try to concatenate them.

pd.concat([df1, df2])

        a  b
0       1  3
1       1  3
2       2  4
3       2  5
(1, 0)  1  3
(1, 1)  1  3
(2, 2)  2  4
(2, 3)  2  5

Good thing is, the apply method is able to handle this concatenation automatically.

def func(data):
    if 1 in data['a'].values:
        return data
    else:
        return data.copy()

df.groupby('a').apply(func)

   a  b
0  1  3
1  1  3
2  2  4
3  2  5

Output of pd.show_versions()

Details

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.44.2
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions