Skip to content

Numerically unstable mean calculation for Timedeltas. #9670

Closed
@musically-ut

Description

@musically-ut

I am not sure whether I should report this here or on numpy. But this is what lead me to the problem:

 In [11]: dAllTags.describe()
Out [11]:
                     finalPeriod
count                      74501
mean    -1 days +02:40:08.792662
std     500 days 06:32:37.640848
min       2 days 00:51:49.730000
25%     498 days 19:11:28.576000
50%     846 days 00:46:56.656000
75%    1245 days 17:11:58.493000
max    2224 days 07:03:26.593000

All the values are positive (the minimum is 2 days) but the mean calculated is negative. This happens because the underlying type of np.timedelta64 is int64 which overflows while calculating the mean.

Now the issue of numerical stability in numpy has had a long history:

And though some steps have been taken to introduce precision accuracy (e.g. by providing fsum and using pairwise summation), there doesn't seem to be a consensus for using a numerically stable method for mean.

I was wondering if something could be done on the Pandas level to resolve this issue.


Currently, I am working around the issue by using the rather elaborate scheme:

df.finalPeriod.view(int).astype(float).mean()

since timedelta64 cannot be directly converted to float64. Is there a better/more intuitive way to do this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions