Python: Numpy standard deviation error

Python: Numpy standard deviation error - python

This is a simple test
import numpy as np
data = np.array([-1,0,1])
print data.std()
>> 0.816496580928
I don't understand how this result been generated? Obviously:
( (1^0.5 + 1^0.5 + 0^0.5)/(3-1) )^0.5 = 1
and in matlab it gives me std([-1,0,1]) = 1. Could you help me get understand how numpy.std() works?

The crux of this problem is that you need to divide by N (3), not N-1 (2). As Iarsmans pointed out, numpy will use the population variance, not the sample variance.
So the real answer is sqrt(2/3) which is exactly that: 0.8164965...
If you happen to be trying to deliberately use a different value (than the default of 0) for the degrees of freedom, use the keyword argument ddofwith a positive value other than 0:
np.std(data, ddof=1)
... but doing so here would reintroduce your original problem as numpy will divide by N - ddof.

It is worth reading the help page for the function/method before suggesting it is incorrect. The method does exactly what the doc-string says it should be doing, divides by 3, because By default ddofis zero.:
In [3]: numpy.std?
String form: <function std at 0x104222398>
File: /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.py
Definition: numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)
Docstring:
Compute the standard deviation along the specified axis.
...
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.

When getting into NumPy from Matlab, you'll probably want to keep the docs for both handy. They're similar but often differ in small but important details. Basically, they calculate the standard deviation differently. I would strongly recommend checking the documentation for anything you use that calculates standard deviation, whether a pocket calculator or a programming language, since the default is not (sorry!) standardized.
Numpy STD: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html
Matlab STD: http://www.mathworks.com/help/matlab/ref/std.html
The Numpy docs for std are a bit opaque, IMHO, especially considering that NumPy docs are generally fairly clear. If you read far enough: The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. (In english, default is pop std dev, set ddof=1 for sample std dev).
OTOH, the Matlab docs make clear the difference that's tripping you up:
There are two common textbook definitions for the standard deviation s of a data vector X. [equations omitted] n is the number of elements in the sample. The two forms of the equation differ only in n – 1 versus n in the divisor.
So, by default, Matlab calculates the sample standard deviation (N-1 in the divisor, so bigger to compensate for the fact this is a sample) and Numpy calculates the population standard deviation (N in the divisor). You use the ddof parameter to switch to the sample standard, or any other denominator you want (which goes beyond my statistics knowledge).
Lastly, it doesn't help on this problem, but you'll probably find this helpful at some point. Link

Related

How can i apply Chebyshev's inequality to this case?

I have the a dataframe which includes heights. The data can not go below zero. That's why i can not use standard deviation as this data is not a normal distribution. I can not use 68-95-99.7 rule here because it fails in my case. Here is my dataframe, mean and SD.
0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981
Mean: 0.41138725956196015
Std: 0.2860541519582141
If I get 2 std, as you can see the number becomes negative.
-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468
I have tried using percentile and not satisfied with it to be honest. How can i apply Chebyshev's inequality to this problem? Here what i did so far:
np.polynomial.Chebyshev(df['Heights'])
But this returns numbers not a SD level i can measure. Or do you think Chebyshev is the best choice in my case?
Expected solution:
I am expecting to get a range like 75% next height will be between 0.40 - 0.43 etc.
EDIT1: Added histogram
To be more clear, I have added my real data's histogram
EDIT2: Some values from real data
Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05
Thanks a lot

You seem to be confusing two ideas from the same mathematician, Chebyshev. These ideas are not the same.
Chebysev's inequality states a fact that is true for many probability distributions. For two standard deviations, it states that three-fourths of the data items will lie within two standard deviations from the mean. As you state, for normal distributions about 19/20 of the items will lie in that interval, but Chebyshev's inequality is an absolute bound that is met by practically all distributions. The fact that your data values are never negative does not change the truth of the inequality; it just makes the actual proportion of values in the interval even larger, so the inequality is even more true (in a sense).
Chebyshev polynomials do not involve statistics, but are simply a series (or two series) of polynomials, commonly used in calculating approximations for computer functions. That is what np.polynomial.Chebyshev involves, and therefore does not seem useful to you at all.
So calculate Chebyshev's inequality yourself. There is no need for a special function for that, since it is so easy (this is Python 3 code):
def Chebyshev_inequality(num_std_deviations):
return 1 - 1 / num_std_deviations**2
You can change that to handle the case where k <= 1 but the idea is obvious.
In your particular case: the inequality says that at least 3/4, or 75%, of the data items will lie within 2 standard deviations of the mean, which means more than 0.41138725956196015 - 2 * 0.2860541519582141 and less than than 0.41138725956196015 + 2 * 0.2860541519582141 (note the different signs), which simplifies to the interval
[-0.16072104435446805, 0.9834955634783884]
In your data, 100% of your data values are in that interval, so Chebyshev's inequality was correct (of course).
Now, if your goal is to predict or estimate where a certain percentile is, Chebyshev's inequality does not help much. It is an absolute lower bound, so it gives one limit to a percentile. For example, by what we did above we know that the 12.5'th percentile is at or above -0.16072104435446805 and the 87.5'th percentile is at or below 0.9834955634783884. Those facts are true but are probably not what you want. If you want an estimate that is closer to the actual percentile, this is not the way to go. The 68-95-99.7 rule is an estimate--the actual locations may be higher or lower, but if the distribution is normal than the estimate will not be far off. Chebyshev's inequality does not do that kind of estimate.
If you want to estimate the 12.5'th and 87.5'th percentiles (showing where 75 percent of all the population will fall) you should calculate those percentiles of your sample and use those values. If you don't know more details about the kind of distribution you have, I don't see any better way. There are reasons why normal distributions are so popular!

It sounds like you want the boundaries for the middle 75% of your data.
The middle 75% of the data is between the 12.5th percentile and the 87.5th percentile, so you can use the quantile function to get the values at the locations:
[df['Heights'].quantile(0.5 - 0.75/2), df['Heights'].quantile(0.5 + 0.75/2)]
#[0.09843618875, 0.75906485625]

As per What does it mean when the standard deviation is higher than the mean? What does that tell you about the data? - Quora, SD is a measure of "spread" and mean is a measure of "position". As you can see, these are more or less independent things. Now, if all your samples are positive, SD cannot be greater than the mean because of the way it's calculated, but 2 or 3 SDs very well can.
So, basically, SD being roughly equal to the mean means that your data are all over the place.
Now, a random variable that's strictly positive indeed cannot be normally distributed. But for a rough estimation, seeing that you still have a bell shape, we can pretend it is and still use SD as a rough measure of the spread (though, since 2 and 3 SD can go into negatives, they lack any physical meaning here whatsoever and so are unusable for the sake of our pretention):
E.g. to get a rough prediction of grass growth, you can still take the mean and apply whatever growth model you're using to it -- that will get the new, prospective mean. Then applying the same to mean±SD will give an idea of the new SD.
This is very rough, of course. But to get any better, you'll have to somehow check which distribution you're dealing with and use its peak and spread characteristics instead of mean and SD. And in any case, your prediction will not be any better than your growth model -- studies of which are anything but conclusive judging by e.g. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3040.2005.01490.x (not a single formula there).

What is the difference between numpy var() and statistics variance() in python?

I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages.
e.g for [1,2,3,4]
from statistics import variance
import numpy as np
print(np.var([1,2,3,4]))
print(variance([1,2,3,4]))
//1.25
//1.6666666666666667
The expected answer of the exercise is calculated with np.var()
Edit
I guess it has to do that the later one is sample variance and not variance. Anyone could explain the difference?

Use this
print(np.var([1,2,3,4],ddof=1))
1.66666666667
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is zero.
The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead.
In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.
Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation
For more information refer this documentation : numpy doc

It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.

Numpy 1 Degree of Freedom

In the following code I'm confused as to what the third line means. What does the ddof = 1 do. I tried looking it up, but I still don't quite understand the concept or the purpose. I would really appreciate it if somebody could point me in the right direction.
Thanks
data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)
print "Mean: %g" % np.mean(data)
print "SD: %g" % np.std(data, **ddof=1**)

First read the documentation:
Means Delta Degrees of Freedom. The divisor used in
calculations is N - ddof, where N represents the number of elements.
By default ddof is zero.
Searching for Degrees of Freedom then explains the statistical concept (emphasis mine):
Estimates of statistical parameters can be based upon different
amounts of information or data. The number of independent pieces of
information that go into the estimate of a parameter are called the
degrees of freedom. In general, the degrees of freedom of an estimate
of a parameter are equal to the number of independent scores that go
into the estimate minus the number of parameters used as intermediate
steps in the estimation of the parameter itself (most of the time the
sample variance has N − 1 degrees of freedom, since it is computed
from N random scores minus the only 1 parameter estimated as
intermediate step, which is the sample mean).

Degrees of freedom is an important concept which you may want to look it up, but the computational difference is actually straight forward, consider these:
In [20]:
x = np.array([6,5,4,6,6,7,2])
In [21]:
np.std(x)
Out[21]:
1.5518257844571737
#default is ddof=0, what this actually does:
In [22]:
np.sqrt((((x-x.mean())**2)/len(x)).sum())
Out[22]:
1.5518257844571737
In [23]:
np.std(x, ddof=1)
Out[23]:
1.6761634196950517
#what ddof=1 does:
In [24]:
np.sqrt((((x-x.mean())**2)/(len(x)-1)).sum())
Out[24]:
1.6761634196950517
In most languages (R, SAS etc), the default is to return std of ddof=1. numpy's default is ddof=0, which something worth noting.

It refers to denominator degrees of freedom
In some cases (e.g. working with population level data) your denominator is N. In other cases (e.g. sample level data) your denominator is N-1 (or whatever your ddof value is set to).
So the difference here is do you want to divide by N, or divide by N-ddof? Where and when you divide by each is a more domain/context specific question
In numpy, the default is ddof = 0 (divide by N), so if you want a different denominator value you have to manually specify it

ddof = 1 refers to Degrees of Freedom.
It's a statistical concept.
The formula for Standard Deviation of population:
But most of the time, we are trying to use sample Standard Deviation to estimate the true Standard Deviation of population. The formula above is a downward-biased estimation, using N-1 instead of N gives us a correction.
Check wiki: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation for more information

Why does numpy std() give a different result to matlab std()?

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?

The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.

The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).

For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

Scipy standard deviation

I'm trying to calculate standard deviation for some distribution and keep getting two different results from two paths. It doesn't make much sense to me - could someone explain why is this happening?
scipy.stats.binom(189, 100/189).std()
6.8622115305451707
scipy.stats.tstd([1]*100 + [0]*89)
0.50047821327986164
Why aren't those two numbers equal?

The basic reason is that you're taking the standard deviation of two quite different things there. I think you're misunderstanding what scipy.stats.binom does. From the documentation:
The probability mass function for binom is:
binom.pmf(k) = choose(n,k) * p**k * (1-p)**(n-k)
for k in {0,1,...,n}.
binom takes n and p as shape parameters.
When you do binom(189, 100/189), you are creating a distribution that could take on any value from 0 to 189. This distribution unsurprisingly has a much larger variance than the other sample data you're using, which is restricted to values of either zero or one.
It looks like what you want would be scipy.stats.binom(1, 100/189).std(). However, you still can't expect the exact same value as what you're getting with your sample data, because the binom.std is computing the standard deviation of the overall distribution, whereas the other version (scipy.stats.tstd([1]*100 + [0]*89)) is computing the standard deviation only of a sample. If you increase the size of your sample (e.g., do scipy.stats.tstd([1]*1000 + [0]*890)), the sample standard deviation will approach the value you're getting from binom.std.
You can also get the population (not sample) std by using scipy.std or numpy.std instead of scipy.stats.tstd. scipy.stats.tstd doesn't have a ddof option to let you choose the degrees of freedom, and always computes a sample std.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.