Skip to content

pd.DataFrame.update very long in 0.20 compare to 0.19.2 #16290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cfrancois7 opened this issue May 8, 2017 · 5 comments
Closed

pd.DataFrame.update very long in 0.20 compare to 0.19.2 #16290

cfrancois7 opened this issue May 8, 2017 · 5 comments
Labels
Performance Memory or execution speed performance
Milestone

Comments

@cfrancois7
Copy link

Same dataFrame, left: empty dataframe, right: full of zero (6000 x 6000 each) with a 3 level multiindex:

# with 0.19.2
%timeit test.update(A)
1 loop, best of 3: 7.85 s per loop

# with 0.20
%timeit test.update(A)
1 loop, best of 3: 3min 25s per loop

Problem description

The update function in pandas 0.20 is much more longer with the same data and same routine.

Output of pd.show_versions()

# 0.20 INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

# 0.19.2 INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 8, 2017

pls show a specific example.

@cfrancois7
Copy link
Author

Where I can share you my index and my sparse matrix? (around 50 Mo)

@jreback
Copy link
Contributor

jreback commented May 9, 2017

pls show df.info() and df.head(), or code to create (best). for both original and updated.

@sinhrks sinhrks added Can't Repro Performance Memory or execution speed performance Needs Info Clarification about behavior needed to assess issue labels May 11, 2017
@jreback
Copy link
Contributor

jreback commented May 11, 2017

this is almost certainly related to the fixes in #16234 , but would need an example to see.

@jorisvandenbossche
Copy link
Member

And confirmed this with a simple example:

In [12]: df1 = pd.DataFrame(np.random.randn(6000,6000), columns=pd.MultiIndex.from_product([np.arange(60), np.arange(10), np.arange(10)]))

In [13]: df2 = pd.DataFrame(np.random.randn(6000,6000), columns=pd.MultiIndex.from_product([np.arange(60), np.arange(10), np.arange(10)]))

In [14]: %time df1.update(df2)
CPU times: user 3.81 s, sys: 368 ms, total: 4.18 s
Wall time: 4.26 s

In [15]: pd.__version__
Out[15]: '0.19.2'
In [22]: %time df1.update(df2)
CPU times: user 3min 34s, sys: 1.15 s, total: 3min 35s
Wall time: 3min 36s

In [23]: pd.__version__
Out[23]: '0.20.1'
In [17]: %time df1.update(df2)
CPU times: user 3.12 s, sys: 404 ms, total: 3.52 s
Wall time: 3.55 s

In [18]: pd.__version__
Out[18]: '0.21.0.dev+29.ge88b658.dirty'

(when more than 10000 columns, there is still a considerable slowdown because then the hashtable engine is used)

@jorisvandenbossche jorisvandenbossche added this to the 0.20.2 milestone May 14, 2017
@jorisvandenbossche jorisvandenbossche removed the Needs Info Clarification about behavior needed to assess issue label May 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants