Skip to content

ENH: add Series.histogram wrapping numpy.histogram #23710

Closed
@bluesquall

Description

@bluesquall

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
np.random.seed(3)
s = pd.Series(np.random.normal(0, 1, 100))
# What I'd like to be able to do:
h, b = s.histogram(20)
h
# array([ 1,  1,  1,  1,  3,  3,  4, 10,  7, 11, 11,  7,  7,  5,  9,  7, 3, 2,  4,  3])
len(b)
# 21
# Or, using a numpy automated bin selection algorithm:
ah, ab = s.histogram(bins='fd')
ah
# array([ 2,  5, 13, 22, 24, 15, 12,  7])
ab
# array([-2.91573775, -2.28150187, -1.64726598, -1.01303009, -0.3787942 , 0.25544168,  0.88967757,  1.52391346,  2.15814934])

Problem description

This is a lightweight wrapper around np.histogram like Series.hist seems to be a lightweight wrapper around matplotlib.pyplot.hist (at least from a user's perspective).

  • It differs from Series.hist in that it returns the histogram counts and bin edges, rather than going straight to a plot.
  • It also allows users to leverage the automatic binning algorithms and the density keyword from np.histogram.
  • But it may not work well with missing data, or with non-numerical series. (I'm happy to pull that thread further, if there's interest.)

In comparison, using pd.cut:

hb = pd.cut(s, 20).value_counts(sort=False)
# or
edges = np.arange(-3, 3,0.5)
hbe = pd.cut(s, bins=edges).value_counts(sort=False)
# or
hbesi = pd.cut(s, bins=edges).value_counts().sort_index()
# but now plotting leaves an empty x-axis
hbesi.plot()
# or one with categorical labels, even though the categories are numerical intervals
hbesi.plot(kind='bar')
  • requires more typing and function calls
  • requires the user to sort by index afterward, or to remember to tell value_counts not to sort
  • returns a series with a categorical index, which leads to a categorical axis when you eventually plot the data (I'm new to pandas, and still haven't figured out how I'm supposed to change a Categorical Index to regular floats. For my immediate application, I can just use pre-defined bin edges and keep that array around, but I'd like to be able to use automated binning in the future.)

Finally, using np.histogram is way faster:

%timeit hbesi = pd.cut(x,edges).value_counts().sort_index()
# 2.76 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hbe = pd.cut(x,edges).value_counts(sort='False')
# 2.58 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit h, b = np.histogram(x,edges)
# 33.2 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related issues and pull requests:

#23580, #3945, #4502, #265

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions