Description
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
np.random.seed(3)
s = pd.Series(np.random.normal(0, 1, 100))
# What I'd like to be able to do:
h, b = s.histogram(20)
h
# array([ 1, 1, 1, 1, 3, 3, 4, 10, 7, 11, 11, 7, 7, 5, 9, 7, 3, 2, 4, 3])
len(b)
# 21
# Or, using a numpy automated bin selection algorithm:
ah, ab = s.histogram(bins='fd')
ah
# array([ 2, 5, 13, 22, 24, 15, 12, 7])
ab
# array([-2.91573775, -2.28150187, -1.64726598, -1.01303009, -0.3787942 , 0.25544168, 0.88967757, 1.52391346, 2.15814934])
Problem description
This is a lightweight wrapper around np.histogram
like Series.hist
seems to be a lightweight wrapper around matplotlib.pyplot.hist
(at least from a user's perspective).
- It differs from
Series.hist
in that it returns the histogram counts and bin edges, rather than going straight to a plot. - It also allows users to leverage the automatic binning algorithms and the density keyword from
np.histogram
. - But it may not work well with missing data, or with non-numerical series. (I'm happy to pull that thread further, if there's interest.)
In comparison, using pd.cut
:
hb = pd.cut(s, 20).value_counts(sort=False)
# or
edges = np.arange(-3, 3,0.5)
hbe = pd.cut(s, bins=edges).value_counts(sort=False)
# or
hbesi = pd.cut(s, bins=edges).value_counts().sort_index()
# but now plotting leaves an empty x-axis
hbesi.plot()
# or one with categorical labels, even though the categories are numerical intervals
hbesi.plot(kind='bar')
- requires more typing and function calls
- requires the user to sort by index afterward, or to remember to tell
value_counts
not to sort - returns a series with a categorical index, which leads to a categorical axis when you eventually plot the data (I'm new to pandas, and still haven't figured out how I'm supposed to change a Categorical Index to regular floats. For my immediate application, I can just use pre-defined bin edges and keep that array around, but I'd like to be able to use automated binning in the future.)
Finally, using np.histogram
is way faster:
%timeit hbesi = pd.cut(x,edges).value_counts().sort_index()
# 2.76 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hbe = pd.cut(x,edges).value_counts(sort='False')
# 2.58 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit h, b = np.histogram(x,edges)
# 33.2 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Related issues and pull requests:
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0