Skip to content

What kind of variance should var() return? #149

Closed
@milancurcic

Description

@milancurcic

Playing with a toy example today, I was surprised to see the result (I guess I should have read the specification first, LOL).

Program:

use stdlib_experimental_stats, only: mean, var
real :: a(5) = [1, 2, 3, 4 ,5]
print *, var(a), mean((a - mean(a))**2)
end

I expected two identical numbers. Result:

   2.50000000       2.00000000    

Then I looked at the code for var() and saw that we're making an average by dividing the sum by (N - 1), rather than N.

Then I looked at this issue and the spec, and it's all good: It says that the variance is defined in such way so that we're dividing by (N - 1). The code works as advertised.

But then I wondered why N - 1 and not N, and did some Google searching and found that there are all kinds of variances out there and that this particular flavor is called "population variance", or as described here the best unbiased estimator. Dividing by just N seems to be called just "variance".

How we define this not only affects the numerical result, but also in some cases the behavior of the program: variance of a scalar is 0, but population variance of a scalar is a NaN. Some discussion here.

NumPy's np.var() for example defines it as a simple variance (divide by N), but you can optionally set a "delta degrees of freedom", so that np.var(x, ddof=1) corresponds to the population variance.

I'm not a statistician but a wave physicist. I expect my variance to divide by N, and NumPy as served me well so far.

Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions