In the following code I'm confused as to what the third line means. What does the ddof = 1 do. I tried looking it up, but I still don't quite understand the concept or the purpose. I would really appreciate it if somebody could point me in the right direction.
Thanks
data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)
print "Mean: %g" % np.mean(data)
print "SD: %g" % np.std(data, **ddof=1**)
First read the documentation:
Means Delta Degrees of Freedom. The divisor used in
calculations is N - ddof, where N represents the number of elements.
By default ddof is zero.
Searching for Degrees of Freedom then explains the statistical concept (emphasis mine):
Estimates of statistical parameters can be based upon different
amounts of information or data. The number of independent pieces of
information that go into the estimate of a parameter are called the
degrees of freedom. In general, the degrees of freedom of an estimate
of a parameter are equal to the number of independent scores that go
into the estimate minus the number of parameters used as intermediate
steps in the estimation of the parameter itself (most of the time the
sample variance has N − 1 degrees of freedom, since it is computed
from N random scores minus the only 1 parameter estimated as
intermediate step, which is the sample mean).
Degrees of freedom is an important concept which you may want to look it up, but the computational difference is actually straight forward, consider these:
In [20]:
x = np.array([6,5,4,6,6,7,2])
In [21]:
np.std(x)
Out[21]:
1.5518257844571737
#default is ddof=0, what this actually does:
In [22]:
np.sqrt((((x-x.mean())**2)/len(x)).sum())
Out[22]:
1.5518257844571737
In [23]:
np.std(x, ddof=1)
Out[23]:
1.6761634196950517
#what ddof=1 does:
In [24]:
np.sqrt((((x-x.mean())**2)/(len(x)-1)).sum())
Out[24]:
1.6761634196950517
In most languages (R, SAS etc), the default is to return std of ddof=1. numpy's default is ddof=0, which something worth noting.
It refers to denominator degrees of freedom
In some cases (e.g. working with population level data) your denominator is N. In other cases (e.g. sample level data) your denominator is N-1 (or whatever your ddof value is set to).
So the difference here is do you want to divide by N, or divide by N-ddof? Where and when you divide by each is a more domain/context specific question
In numpy, the default is ddof = 0 (divide by N), so if you want a different denominator value you have to manually specify it
ddof = 1 refers to Degrees of Freedom.
It's a statistical concept.
The formula for Standard Deviation of population:
But most of the time, we are trying to use sample Standard Deviation to estimate the true Standard Deviation of population. The formula above is a downward-biased estimation, using N-1 instead of N gives us a correction.
Check wiki: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation for more information
Related
I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages.
e.g for [1,2,3,4]
from statistics import variance
import numpy as np
print(np.var([1,2,3,4]))
print(variance([1,2,3,4]))
//1.25
//1.6666666666666667
The expected answer of the exercise is calculated with np.var()
Edit
I guess it has to do that the later one is sample variance and not variance. Anyone could explain the difference?
Use this
print(np.var([1,2,3,4],ddof=1))
1.66666666667
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is zero.
The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead.
In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.
Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation
For more information refer this documentation : numpy doc
It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.
from datetime import datetime
from pandas.io.data import DataReader
from numpy import cumsum, log, polyfit, sqrt, std, subtract
from numpy.random import randn
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
# Here it calculates the variances, but why it uses
# standard deviation and then make a root of it?
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
# Download the stock prices series from Yahoo
aapl = DataReader("AAPL", "yahoo", datetime(2012,1,1), datetime(2015,9,18))
# Call the function
hurst(aapl['Adj Close'])
From this code for estimating Hurst Exponent, when we want to calculate the variance of the lagged difference, why we still use a standard deviation and take a square root? I am confused for a long time, and I don't know why others don't have the same confuse. Do I misunderstand the math behind it? Thanks!
I'm just as confused. I don't understand where the sqrt of std comes from either, and have spent 3 days trying to figure it out. In the end I noticed QuantStart credits Dr Tom Starke who uses a slightly different code. Dr Tom Starke credits Dr Ernie Chan, and going to his blog. I was able to find enough information to put together my own code from his principles. This doesn't use sqrt, uses variance instead of std and uses a 2.0 divisor at the end instead of a 2.0 multiplier. In the end, it seems to give the same results as the quantstart code you post, but I am able to understand it from first principles, which I guess is important. I put together a Jupyter Notebook which makes it clearer, but I'm not sure if I can post that here, so I will try to explain as best I can here. Code is pasted first, then an explanation.
lags = range(2,100)
def hurst_ernie_chan(p):
variancetau = []; tau = []
for lag in lags:
# Write the different lags into a vector to compute a set of tau or lags
tau.append(lag)
# Compute the log returns on all days, then compute the variance on the difference in log returns
# call this pp or the price difference
pp = subtract(p[lag:], p[:-lag])
variancetau.append(var(pp))
# we now have a set of tau or lags and a corresponding set of variances.
#print tau
#print variancetau
# plot the log of those variance against the log of tau and get the slope
m = polyfit(log10(tau),log10(variancetau),1)
hurst = m[0] / 2
return hurst
Dr Chan doesn't give any code on this page (I believe he works in MATLAB not Python anyway). Hence I needed to put together my own code from the notes he gives in his blog and answers he gives to questions posed on his blog.
Dr Chan states that if z is the log price, then volatility, sampled at intervals of τ, is volatility(τ)=√(Var(z(t)-z(t-τ))). To me another way of describing volatility is standard deviation, so std(τ)=√(Var(z(t)-z(t-τ)))
std is just the root of variance so var(τ)=(Var(z(t)-z(t-τ)))
Dr Chan then states: In general, we can write Var(τ) ∝ τ^(2H) where H is the Hurst exponent
Hence (Var(z(t)-z(t-τ))) ∝ τ^(2H)
Taking the log of each side we get log (Var(z(t)-z(t-τ))) ∝ 2H log τ
[ log (Var(z(t)-z(t-τ))) / log τ ] / 2 ∝ H (gives the Hurst exponent) where we know the term in square brackets on far left is the slope of a log-log plot of tau and a corresponding set of variances.
If you run that function and compare the answers to the Quantstart function, they should be the same. Not sure if that helped.
All that is going on here is a variation on math notation
I'll define
d = subtract(ts[lag:], ts[:-lag])
Then it is clear that
np.log(np.std(d)**2) == np.log(np.var(d))
np.log(np.std(d)) == .5*np.log(np.var(d))
Then you have the equivalence
2*np.log(np.sqrt(np.std(d))) == .5*np.log(np.sqrt(np.var(d)))
The functional output of polyfit scales proportionally to its input
As per intuitive definition taken from Ernest Chan's "Algorithmic trading" (p.44):
Intuitively speaking, a “stationary” price series means that the prices diffuse
from its initial value more slowly than a geometric random walk would.
one would want to check variance of time series with increasing lags against lag(s). This is because for normal distribution -- and log prices are believed to be normal (to certain extent) -- variance of sum of normal distributions is a sum of constituents' variances.
As per Ernest Chan's citation, for mean reverting processes the realized variance will be less than theoretically projected.
Putting this in code:
def hurst(p, l):
"""
Arguments:
p: ndarray -- the price series to be tested
l: list of integers or an integer -- lag(s) to test for mean reversion
Returns:
Hurst exponent
"""
if isinstance(l, int):
lags = [1, l]
else:
lags = l
assert lags[-1] >=2, "Lag in prices must be greater or equal 2"
print(f"Price lags of {lags[1:]} are included")
lp = np.log(p)
var = [np.var(lp[l:] - lp[:-l]) for l in lags]
hr = linregress(np.log(lags), np.log(var))[0] / 2
return hr
The code posted by OP is correct.
The reason for the confusion is that it does a square-root first, and then counters it by multiplying the slope (returned by polyfit) with 2.
For a more detailed explanation, continue reading.
tau is calculated with an "extra" square-root. Then, its log is calculated. log(sqrt(x)) = log(x^0.5) = 0.5*log(x) (this is the key).
polyfit now conducts the fitting with y multiplied by an "extra 0.5". So, the result obtained is also multiplied by nearly 0.5. Returning twice of that (return poly[0]*2.0) counters the initial (seemingly) extra 0.5.
Hope this makes it clearer.
I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.
from random import *
def main():
t = 0
for i in range(1000): # thousand
t += random()
print(t/1000)
main()
I was looking at the source code for a sample program my professor gave me and I came across this RNG. can anyone explain how this RNG works?
If you plotted the points, you would see that this actually produces a Gaussian ("normal") distribution about the mean of the random function.
Generate random numbers following a normal distribution in C/C++ talks about random number generation; it's a pretty common technique to do this if all you have is a uniform number generator like in standard C.
What I've given you here is a histogram of 100,000 values drawn from your function (of course, returned not printed, if you aren't familiar with python). The y axis is the frequency that the value appears, the x axis is the bin of the value. As you can see, the average value is 1/2, and by 3 standard deviations (99.7 percent of the data) we have almost no values in the range. That should be intuitive; we "usually" get 1/2, and very rarely get .99999
Have a look at the documentation. Its quite well written:
https://docs.python.org/2/library/random.html
The idea is that that program generates a random number 1000 times which is sufficiently enough to get mean as 0.5
The program is using the Central Limit Theorem - sums of independent and identically distributed random variables X with finite variance asymptotically converge to a normal (a.k.a. Gaussian) distribution whose mean is the sum of the means, and variance is the sum of the variances. Scaling this by N, the number of X's summed, gives the sample mean (a.k.a. average). If the expected value of X is μ and the variance of X is σ2, the expected value of the sample mean is also μ and it has variance σ2 / N.
Since a Uniform(0,1) has mean 0.5 and variance 1/12, your algorithm will generate results that are pretty close to normally distributed with a mean of 0.5 and a variance of 1/12000. Consequently 99.7% of the outcomes should fall within +/-3 standard deviations of the mean, i.e., in the range 0.5+/-0.0274.
This is a ridiculously inefficient way to generate normals. Better alternatives include the Box-Muller method, Polar method, or ziggurat method.
The thing making this random is the random() function being called. random() will generate 1 (for most practical purposes) random float between 0 and 1.
>>>random()
0.1759916412898097
>>>random()
0.5489228122596088
etc.
The rest of it is just adding each random to a total and then dividing by the number of randoms, essentially finding the average of all 1000 randoms, which as Cyber pointed out is actually not a random number at all.
This is a simple test
import numpy as np
data = np.array([-1,0,1])
print data.std()
>> 0.816496580928
I don't understand how this result been generated? Obviously:
( (1^0.5 + 1^0.5 + 0^0.5)/(3-1) )^0.5 = 1
and in matlab it gives me std([-1,0,1]) = 1. Could you help me get understand how numpy.std() works?
The crux of this problem is that you need to divide by N (3), not N-1 (2). As Iarsmans pointed out, numpy will use the population variance, not the sample variance.
So the real answer is sqrt(2/3) which is exactly that: 0.8164965...
If you happen to be trying to deliberately use a different value (than the default of 0) for the degrees of freedom, use the keyword argument ddofwith a positive value other than 0:
np.std(data, ddof=1)
... but doing so here would reintroduce your original problem as numpy will divide by N - ddof.
It is worth reading the help page for the function/method before suggesting it is incorrect. The method does exactly what the doc-string says it should be doing, divides by 3, because By default ddofis zero.:
In [3]: numpy.std?
String form: <function std at 0x104222398>
File: /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.py
Definition: numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)
Docstring:
Compute the standard deviation along the specified axis.
...
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.
When getting into NumPy from Matlab, you'll probably want to keep the docs for both handy. They're similar but often differ in small but important details. Basically, they calculate the standard deviation differently. I would strongly recommend checking the documentation for anything you use that calculates standard deviation, whether a pocket calculator or a programming language, since the default is not (sorry!) standardized.
Numpy STD: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html
Matlab STD: http://www.mathworks.com/help/matlab/ref/std.html
The Numpy docs for std are a bit opaque, IMHO, especially considering that NumPy docs are generally fairly clear. If you read far enough: The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. (In english, default is pop std dev, set ddof=1 for sample std dev).
OTOH, the Matlab docs make clear the difference that's tripping you up:
There are two common textbook definitions for the standard deviation s of a data vector X. [equations omitted] n is the number of elements in the sample. The two forms of the equation differ only in n – 1 versus n in the divisor.
So, by default, Matlab calculates the sample standard deviation (N-1 in the divisor, so bigger to compensate for the fact this is a sample) and Numpy calculates the population standard deviation (N in the divisor). You use the ddof parameter to switch to the sample standard, or any other denominator you want (which goes beyond my statistics knowledge).
Lastly, it doesn't help on this problem, but you'll probably find this helpful at some point. Link