3 way joins column suffix looks wrong #18541

barisser · 2017-11-28T05:01:24Z

Code Sample

import pandas as pd
import numpy as np


dataframe1 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe2 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe3 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

joined_dataframe = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').join(dataframe3, lsuffix='_nothing', rsuffix='_third')

print joined_dataframe.columns

This outputs the columns on the final dataframe. These have the wrong suffixes. The above script prints the following with pandas 0.21.0

Index([u'a_first', u'b_first', u'a_second', u'b_second', u'a', u'b'], dtype='object')

Problem description

In my above example, the first join operation gives the correct column suffixes. But the second operation does not. In my experience, it does not matter how you divide them steps, the second join operation never assigns the correct column suffixes.

The joined_dataframe should of course have columns which reflect the suffixes in my second join, namely "_nothing" on columns from dataframe 1 and 2, and "_third" on columns from dataframe 3.

Expected Output

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jschendel · 2017-11-28T07:32:50Z

This is the expected behavior. Per the documentation of DataFrame.join, lsuffix and rsuffix are only applied to overlapping columns. There are no overlapping columns at the time you're performing the join against dataframe3, as you're joining against the joined version of dataframe1 and dataframe2, which have had their columns renamed per the lsuffix and rsuffix specifications of the first join.

In other words, what you're doing is essentially being broken down analogous to:

temp = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').
joined_dataframe = temp.join(dataframe3, lsuffix='_nothing', rsuffix='_third')

Where temp and dataframe3 have no overlapping columns, and hence lsuffix and rsuffix aren't being applied.

jschendel · 2017-11-28T07:33:25Z

xref #17834, which I think is proposing something similar to what you'd need

jorisvandenbossche · 2017-11-28T08:54:18Z

Something you can do in the meantime to solve this is using concat and massaging the columns afterwards a bit:

In [22]: res = pd.concat([dataframe1, dataframe2, dataframe3], axis=1, keys=['first', 'second', 'third'])

In [23]: res.columns = [i + '_' + j for i, j in res.columns.swaplevel(0, 1)]

In [24]: res.head()
Out[24]: 
    a_first  b_first  a_second  b_second   a_third  b_third
0  0.369871      1.0  0.318002       1.0  0.676463      1.0
1  0.573191      1.0  0.877985       1.0  0.419603      1.0
2  0.810390      1.0  0.216200       1.0  0.198755      1.0
3  0.782245      1.0  0.968843       1.0  0.199693      1.0
4  0.162413      1.0  0.522495       1.0  0.852937      1.0

jorisvandenbossche · 2017-11-28T08:55:35Z

Let's close this issue, as indeed a possible solution is #17834, so let's continue the discussion whether we want to add something like that or not there.

barisser · 2017-11-28T17:48:05Z

Makes sense. Thanks.

jorisvandenbossche closed this as completed Nov 28, 2017

jorisvandenbossche added this to the No action milestone Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

3 way joins column suffix looks wrong #18541

3 way joins column suffix looks wrong #18541

barisser commented Nov 28, 2017 •

edited

Loading

jschendel commented Nov 28, 2017

Uh oh!

jschendel commented Nov 28, 2017

Uh oh!

jorisvandenbossche commented Nov 28, 2017

Uh oh!

jorisvandenbossche commented Nov 28, 2017

Uh oh!

barisser commented Nov 28, 2017

Uh oh!

Uh oh!

3 way joins column suffix looks wrong #18541

3 way joins column suffix looks wrong #18541

Comments

barisser commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Nov 28, 2017

Uh oh!

jschendel commented Nov 28, 2017

Uh oh!

jorisvandenbossche commented Nov 28, 2017

Uh oh!

jorisvandenbossche commented Nov 28, 2017

Uh oh!

barisser commented Nov 28, 2017

Uh oh!

barisser commented Nov 28, 2017 •

edited

Loading

Output of `pd.show_versions()`