Description
I am not sure whether I should report this here or on numpy
. But this is what lead me to the problem:
In [11]: dAllTags.describe()
Out [11]:
finalPeriod
count 74501
mean -1 days +02:40:08.792662
std 500 days 06:32:37.640848
min 2 days 00:51:49.730000
25% 498 days 19:11:28.576000
50% 846 days 00:46:56.656000
75% 1245 days 17:11:58.493000
max 2224 days 07:03:26.593000
All the values are positive (the minimum is 2 days
) but the mean
calculated is negative. This happens because the underlying type of np.timedelta64
is int64
which overflows while calculating the mean.
Now the issue of numerical stability in numpy
has had a long history:
- Numerical stability numpy/numpy#4694
- numpy.mean(): accumulator default type should not be single precision (Trac #435) numpy/numpy#1033
- ndarray's mean method should be computed using double precision (Trac #465) numpy/numpy#1063
- Numerical-stable sum (similar to math.fsum) (Trac #1855) numpy/numpy#2448
And though some steps have been taken to introduce precision accuracy (e.g. by providing fsum
and using pairwise summation), there doesn't seem to be a consensus for using a numerically stable method for mean
.
I was wondering if something could be done on the Pandas level to resolve this issue.
Currently, I am working around the issue by using the rather elaborate scheme:
df.finalPeriod.view(int).astype(float).mean()
since timedelta64
cannot be directly converted to float64
. Is there a better/more intuitive way to do this?