Skip to content

join type changes silently when left-joining with a list of dataframes #19607

Closed
@elrubio

Description

@elrubio

Code sample, copy-pastable

import pandas as pd

# Note that index of d2 is _not_ unique
d1 = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]}).set_index('k')
d2 = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]}).set_index('k')

print('DataFrame d1:', d1, sep="\n")
print('DataFrame d2:', d2, sep="\n")

print("d1.join(d2, how='left') looks fine:")
d = d1.join(d2, how='left', lsuffix='_left', rsuffix='_right')
print(d)

print("But with d1.join([d2], how='left', ...), K3 shows up in index but shouldn't:")
d = d1.join([d2], how='left', lsuffix='_left', rsuffix='_right')
print(d)

print("pd.merge(d1, d2, how='left') works as expected:")
d = pd.merge(d1, d2, how='left', left_index=True, right_index=True)
print(d)

print("d1.join([d3], ...) also works as expected when indices are unique:")
d3 = pd.DataFrame({'k': ['K0', 'K4', 'K3'], 'v': [4, 5, 6]}).set_index('k')
d = d1.join(d3, how='left', lsuffix='_left', rsuffix='_right')
print(d)

# Output:
#
# DataFrame d1:
#     v
# k    
# K0  1
# K1  2
# K2  3
#
# DataFrame d2:
#     v
# k    
# K0  4
# K0  5
# K3  6
#
# d1.join(d2, how='left') looks fine:
#     v_left  v_right
# k                  
# K0       1      4.0
# K0       1      5.0
# K1       2      NaN
# K2       3      NaN
#
# But with d1.join([df], how='left', ...), K3 shows up in index but shouldn't:
#     v_x  v_y
# k           
# K0  1.0  4.0
# K0  1.0  5.0
# K1  2.0  NaN
# K2  3.0  NaN
# K3  NaN  6.0        <----- UNEXPECTED
#
# pd.merge(d1, d2, how='left') works as expected:
#     v_x  v_y
# k           
# K0    1  4.0
# K0    1  5.0
# K1    2  NaN
# K2    3  NaN
#
# d1.join([d3], ...) also works as expected when indices are unique:
#     v_left  v_right
# k                  
# K0       1      4.0
# K1       2      NaN
# K2       3      NaN

Problem description

df.join() silently changes a 'left' join to an 'outer' join when these conditions are met:

  • join() is called with a sequence of dataframes, e.g. d1.join([d2], how='left', ...)
  • some dataframe has a non-unique index

Note that the equivalent call to merge(d1, d2, how='left', ...) works just fine.

When looking at the join() code, I see concat() being discarded when dataframe indices are not unique and merge() being called instead. This would work perfectly if not how='left' had just been changed to how='outer' a couple of lines above.

Expected Output

d1.join(d2, how='left', ...) should give the same result as pd.merge(d1, d2, how='left', ...). The join type should not be changed.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions