Skip to content

3 way joins column suffix looks wrong #18541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
barisser opened this issue Nov 28, 2017 · 5 comments
Closed

3 way joins column suffix looks wrong #18541

barisser opened this issue Nov 28, 2017 · 5 comments

Comments

@barisser
Copy link

barisser commented Nov 28, 2017

Code Sample

import pandas as pd
import numpy as np


dataframe1 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe2 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe3 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

joined_dataframe = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').join(dataframe3, lsuffix='_nothing', rsuffix='_third')

print joined_dataframe.columns

This outputs the columns on the final dataframe. These have the wrong suffixes. The above script prints the following with pandas 0.21.0

Index([u'a_first', u'b_first', u'a_second', u'b_second', u'a', u'b'], dtype='object')

Problem description

In my above example, the first join operation gives the correct column suffixes. But the second operation does not. In my experience, it does not matter how you divide them steps, the second join operation never assigns the correct column suffixes.

The joined_dataframe should of course have columns which reflect the suffixes in my second join, namely "_nothing" on columns from dataframe 1 and 2, and "_third" on columns from dataframe 3.

Expected Output

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

This is the expected behavior. Per the documentation of DataFrame.join, lsuffix and rsuffix are only applied to overlapping columns. There are no overlapping columns at the time you're performing the join against dataframe3, as you're joining against the joined version of dataframe1 and dataframe2, which have had their columns renamed per the lsuffix and rsuffix specifications of the first join.

In other words, what you're doing is essentially being broken down analogous to:

temp = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').
joined_dataframe = temp.join(dataframe3, lsuffix='_nothing', rsuffix='_third')

Where temp and dataframe3 have no overlapping columns, and hence lsuffix and rsuffix aren't being applied.

@jschendel
Copy link
Member

xref #17834, which I think is proposing something similar to what you'd need

@jorisvandenbossche
Copy link
Member

Something you can do in the meantime to solve this is using concat and massaging the columns afterwards a bit:

In [22]: res = pd.concat([dataframe1, dataframe2, dataframe3], axis=1, keys=['first', 'second', 'third'])

In [23]: res.columns = [i + '_' + j for i, j in res.columns.swaplevel(0, 1)]

In [24]: res.head()
Out[24]: 
    a_first  b_first  a_second  b_second   a_third  b_third
0  0.369871      1.0  0.318002       1.0  0.676463      1.0
1  0.573191      1.0  0.877985       1.0  0.419603      1.0
2  0.810390      1.0  0.216200       1.0  0.198755      1.0
3  0.782245      1.0  0.968843       1.0  0.199693      1.0
4  0.162413      1.0  0.522495       1.0  0.852937      1.0

@jorisvandenbossche
Copy link
Member

Let's close this issue, as indeed a possible solution is #17834, so let's continue the discussion whether we want to add something like that or not there.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 28, 2017
@barisser
Copy link
Author

Makes sense. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants