Description
Code sample, copy-pastable
import pandas as pd
# Note that index of d2 is _not_ unique
d1 = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]}).set_index('k')
d2 = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]}).set_index('k')
print('DataFrame d1:', d1, sep="\n")
print('DataFrame d2:', d2, sep="\n")
print("d1.join(d2, how='left') looks fine:")
d = d1.join(d2, how='left', lsuffix='_left', rsuffix='_right')
print(d)
print("But with d1.join([d2], how='left', ...), K3 shows up in index but shouldn't:")
d = d1.join([d2], how='left', lsuffix='_left', rsuffix='_right')
print(d)
print("pd.merge(d1, d2, how='left') works as expected:")
d = pd.merge(d1, d2, how='left', left_index=True, right_index=True)
print(d)
print("d1.join([d3], ...) also works as expected when indices are unique:")
d3 = pd.DataFrame({'k': ['K0', 'K4', 'K3'], 'v': [4, 5, 6]}).set_index('k')
d = d1.join(d3, how='left', lsuffix='_left', rsuffix='_right')
print(d)
# Output:
#
# DataFrame d1:
# v
# k
# K0 1
# K1 2
# K2 3
#
# DataFrame d2:
# v
# k
# K0 4
# K0 5
# K3 6
#
# d1.join(d2, how='left') looks fine:
# v_left v_right
# k
# K0 1 4.0
# K0 1 5.0
# K1 2 NaN
# K2 3 NaN
#
# But with d1.join([df], how='left', ...), K3 shows up in index but shouldn't:
# v_x v_y
# k
# K0 1.0 4.0
# K0 1.0 5.0
# K1 2.0 NaN
# K2 3.0 NaN
# K3 NaN 6.0 <----- UNEXPECTED
#
# pd.merge(d1, d2, how='left') works as expected:
# v_x v_y
# k
# K0 1 4.0
# K0 1 5.0
# K1 2 NaN
# K2 3 NaN
#
# d1.join([d3], ...) also works as expected when indices are unique:
# v_left v_right
# k
# K0 1 4.0
# K1 2 NaN
# K2 3 NaN
Problem description
df.join()
silently changes a 'left' join to an 'outer' join when these conditions are met:
join()
is called with a sequence of dataframes, e.g.d1.join([d2], how='left', ...)
- some dataframe has a non-unique index
Note that the equivalent call to merge(d1, d2, how='left', ...)
works just fine.
When looking at the join() code, I see concat()
being discarded when dataframe indices are not unique and merge()
being called instead. This would work perfectly if not how='left'
had just been changed to how='outer'
a couple of lines above.
Expected Output
d1.join(d2, how='left', ...)
should give the same result as pd.merge(d1, d2, how='left', ...)
. The join type should not be changed.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None