I have below code
result=0
loop_n=10000
for i in range(loop_n):
result+=np.random.rand(3,4,10).std()
result=result/loop_n
print(result)
As I understand, if I run multi times, the result should be different because the result comes from random, but actually the result is always around 0.287
Is there some theory behind this?
It is just a proof that np.random.rand is a nice uniform random generator. You have 10000 observations of the standard deviation of a distribution following the same law. Standard deviation is the square root of variance, so or a uniform the theoretic (probabilistic) standard deviation is (max - min) / sqrt(12). You have a fairly large sample size, so the observed estimator will be close to the theoretic standard deviation which is 1/sqrt(12), about 0.28867513459481287. But it nows becomes a mathematical question :-)
Assuming a uniform distribution on [0,1], the probabilistic (theoretic) mean E(X) is the integral of x on segment [0-1], which is 0.5. The variance is by definition E((X-E(X))2) which can be computed as the integral of x2 on the segment [-0.5,0.5] and its square root gives the result written above.
1. Why so little variability?
That is the Law of large numbers. If you sample from a random variable often enough you expect to get a good estimate of the true mean.
https://en.wikipedia.org/wiki/Law_of_large_numbers
2. Why 0.287?
rand returns uniformly distributed numbers between 0 and 1 the true mean is therefore 1/2 and the true variance
is integral[-1/2..1/2] x^2 dx which you can check to be 1/12. The std is the square root of that ~0.289.
3. Why not exactly sqrt(1/12) ~ 0.289?
But wait, that's a bit off. Why? Because numpy returns the sample var/std which is a biased estimator of the real thing, it systematically underestimates them. As you sample in relatively small batches of size N=120 this makes a small but consistent difference. Once we plug in the correction N/(N-1) (sqrt of that for the std) we get a better match. You can try this in your code by passing keyword ddof=1 to the std.
4. But with the correction the result seems a smidge too small?
That is correct. The correction factor N/(N-1) yields an unbiased estimator for the var but not for the std, basically because taking the mean and then the sqrt is not the same as taking the sqrt and then the mean.
You can check this by using var (still with argument ddof=1) instead of std and taking the sqrt after taking the mean:
loop_n=1000000
result=0
print_at = 1
for i in range(1, loop_n+1):
result+=np.random.rand(3,4,10).var(ddof=1)
if i == print_at:
print(math.sqrt(result/i))
print_at *= 10
print("...")
print(math.sqrt(1/12))
Sample run:
0.28103387158480164
0.2952158859220745
0.2902562660869275
0.28882685146952614
0.2887019908636715
0.2886783761564752
0.2886714244895549
...
0.28867513459481287
Let's look what you are doing:
In each step, you have np generate 120 random values between 0 and 1 and get their standard deviation. It is always around 0.2887, sometimes more, sometimes less. Explanation see below.
You add up all those standard deviations and divide them by their count. Essentially, you get their mean value.
Because you have so many of them, they come closer and closer to the expected value of 0.2887.
Explanation:
If you do while 1: np.random.rand(3,4,10).std() in a Python console, you see a lot of numbers emitted (until you press Ctrl-C), and they are sometimes .266, sometimes .297 and so on.
But what do they mean? Well, the standard deviation is (very roughly spoken) the mean value of the distances of a collection of values from their mean value.
If you take [.5, .5, .5], the mean value is .5, the std is 0.
But with [0, .5, 1], the mean value is .5 as well, but the std is .408248.
With np.f64([.0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1]).std(), you get .316.
With np.random.rand(300,300,300).std(), you get about the same result as you do: always something around .2887.
Why the expected value is exactly .2887 derives from the definition of the standard deviation. Essentially, it stems from the expected uniform distribution of what np.random.rand() produces.
The numpy function rand draws a random number from the uniform distribution [0, 1), which means that there is an equal probability to get any number between 0 and 1. Your code draws 120 random numbers from this distribution and computes an estimate of the standard deviation by using the formula
std = sqrt(mean(abs(x - x.mean())**2))
Your code then computes an average of the standard deviation estimate which should make the estimate converge to the theoretical value.
To compute the theoretical value we can use that variance(x) = 1/12 for a random variable X in the uniform distribution. This implies that std(x) = sqrt(1/12) = 0.2887, which is close to the simulation result.
Related
I have to produce randomly generated, normally distributed numbers, based on an astronomical table containing a most probable value and a standard deviation. The peculiar thing is that the standard deviation is not given by one, but two numbers - an upper standard deviation of the error and a lower, something like this:
mass_object, error_up, error_down
7.33, 0.12, 0.07
9.40, 0.04, 0.02
6.01, 0.11, 0.09
...
For example, this means for the first object that if a random mass m gets generated with m<7.33, then probably it will be further away from 7.33 than in the case that m>7.33. So I am looking now for a way to randomly generate the numbers and this way has to include the 2 possible standard deviations. If I was dealing with just one standard deviation per object, I would create the random number (mass) of the first object like that:
mass_random = np.random.normal(loc=7.33, scale=0.12)
Do you have ideas how to create these random numbers with upper and lower standard deviation of the scatter? Tnx
As we discussed in the comments, a normal distribution has the same standard deviation in each direction (it's symmetric around the mean). So we know our distribution won't be normal. We can try a lognormal approach, since this allows us to introduce the idea of skewness. To do this in Python, you'll need Scipy. Here's a crude approach, assuming that 68% of data is on the mean, 16% is at the high point, and 16% is at the low point. We fit the distribution to that crude dataset, then we can calculate new points from the distribution:
import scipy.stats
# Choose one of the rows
mean, high, low = 7.33, 0.12, 0.07
# Create a dummy dataset to fit the distribution
values = [mean] * 68 + [mean + high] * 16 + [mean - low ] * 16
# Print the fit distribution
fit_dist = scipy.stats.lognorm.fit(values)
print(fit_dist)
# Calculate 10 new random values based on the fit
scipy.stats.lognorm.rvs(*fit_dist, size=10)
array([7.25541865, 7.34873107, 7.33831589, 7.36387121, 7.26912469,
7.33084677, 7.35626689, 7.33907124, 7.32522422, 7.31688687])
The immediate solution would be a two step sampling:
for a given row i, one samples from a uniform distribution over the interval error_down and error_up obtaining \sigma_i, and then one samples the final value from a normal distribution with mean m_i and standard deviation \sigma_i.
In practice, one imports numpy, defines a custom function sampling and, then, applies it at the whole table:
import numpy as np
def sampling (row) :
sigma = np.random.uniform(row[1], row[2])
m = row[0]
return (np.random.normal(m, sigma))
sampled_values = map(sampling, table)
I couldn't understand how to properly use this function, could someone please explain it to me?
Let's say I have:
a mean of 172.7815
a standard deviation of 4.1532
N = 50 (50 samples)
When I'm asked to calculate the (95%) margin of error using norm.ppf() will the code look like below?
norm.ppf(0.95, loc=172.78, scale=4.15)
or will it look like this?
norm.ppf(0.95, loc=0, scale=1)
Because I know it's calculating the area of the curve to the right of the confidence interval (95%, 97.5% etc...see image below), but when I have a mean and a standard deviation, I get really confused as to how to use the function.
The method norm.ppf() takes a percentage and returns a standard deviation multiplier for what value that percentage occurs at.
It is equivalent to a, 'One-tail test' on the density plot.
From scipy.stats.norm:
ppf(q, loc=0, scale=1) Percent point function (inverse of cdf — percentiles).
Standard Normal Distribution
The code:
norm.ppf(0.95, loc=0, scale=1)
Returns a 95% significance interval for a one-tail test on a standard normal distribution (i.e. a special case of the normal distribution where the mean is 0 and the standard deviation is 1).
Our Example
To calculate the value for OP-provided example at which our 95% significance interval lies (For a one-tail test) we would use:
norm.ppf(0.95, loc=172.7815, scale=4.1532)
This will return a value (that functions as a 'standard-deviation multiplier') marking where 95% of data points would be contained if our data is a normal distribution.
To get the exact number, we take the norm.ppf() output and multiply it by our standard deviation for the distribution in question.
A Two-Tailed Test
If we need to calculate a 'Two-tail test' (i.e. We're concerned with values both greater and less than our mean) then we need to split the significance (i.e. our alpha value) because we're still using a calculation method for one-tail. The split in half symbolizes the significance level being appropriated to both tails. A 95% significance level has a 5% alpha; splitting the 5% alpha across both tails returns 2.5%. Taking 2.5% from 100% returns 97.5% as an input for the significance level.
Therefore, if we were concerned with values on both sides of our mean, our code would input .975 to represent a 95% significance level across two-tails:
norm.ppf(0.975, loc=172.7815, scale=4.1532)
Margin of Error
Margin of error is a significance level used when estimating a population parameter with a sample statistic. We want to generate our 95% confidence interval using the two-tailed input to norm.ppf() since we're concerned with values both greater and less than our mean:
ppf = norm.ppf(0.975, loc=172.7815, scale=4.1532)
Next, we'd take the ppf and multiply it by our standard deviation to return the interval value:
interval_value = std * ppf
Finally, we'd mark the confidence intervals by adding & subtracting the interval value from the mean:
lower_95 = mean - interval_value
upper_95 = mean + interval_value
Plot with a vertical line:
_ = plt.axvline(lower_95, color='r', linestyle=':')
_ = plt.axvline(upper_95, color='r', linestyle=':')
James' statement that norm.ppf returns a "standard deviation multiplier" is wrong. This feels pertinent as his post is the top google result when one searches for norm.ppf.
'norm.ppf' is the inverse of 'norm.cdf'. In the example, it simply returns the value at the 95% percentile. There is no "standard deviation multiplier" involved.
A better answer exists here:
How to calculate the inverse of the normal cumulative distribution function in python?
You can figure out the confidence interval with norm.ppf directly, without calculating margin of error
upper_of_interval = norm.ppf(0.975, loc=172.7815, scale=4.1532/np.sqrt(50))
lower_of_interval = norm.ppf(0.025, loc=172.7815, scale=4.1532/np.sqrt(50))
4.1532 is sample standard deviation, not the standard deviation of the sampling distribution of the sample mean. So, scale in norm.ppf will be specified as scale = 4.1532 / np.sqrt(50), which is the estimator of standard deviation of the sampling distribution.
(The value of standard deviation of the sampling distribution is equal to population standard deviation / np.sqrt(sample size). Here, we did not know the population standard deviation and the sample size is more than 30, so sample standard deviation / np.sqrt(sample size) can be used as a good estimator).
Margin of error can be calculated with (upper_of_interval - lower_of_interval) / 2.
calculate the amount for the 95% percentile and draw a vertical line and an annotation with the amount
mean=172.7815
std=4.1532
N = 50
results=norm.rvs(mean,std, size=N)
pct_5 = norm.ppf(.95,mean,std)
plt.hist(results,bins=10)
plt.axvline(pct_5)
plt.annotate(pct_5,xy=(pct_5,6))
plt.show()
As other answers pointed out, norm.ppf(1-alpha) returns the value on the (1-alpha)x100-th percentile of a normal distribution specified by the parameters passed to the it. For example in the OP, it returns the 95th percentile of a normal distribution with mean 172.78 and standard deviation 4.15.
If you're looking for a function that returns the same value (N-th percentile on the normal distribution) as a function of alpha instead, there's the inverse survival function, norm.isf(alpha), which tells you the number at which (1-alpha) is above it.
from scipy.stats import norm
alpha = 0.05
v1 = norm.isf(alpha)
v2 = norm.ppf(1-alpha)
np.isclose(v1, v2) # True
I have a question:
Given mean and variance I want to calculate the probability of a sample using a normal distribution as probability basis.
The numbers are:
mean = -0.546369
var = 0.006443
curr_sample = -0.466102
prob = 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) )
I get a probability which is larger than 1! I get prob = 3.014558...
What is causing this? The fact that the variance is too small messes something up? It's a totally legal input to the formula and should give something small not greater than 1! Any suggestions?
Ok, what you compute is not a probability, but a probability density (which may be larger than one). In order to get 1 you have to integrate over the normal distribution like so:
import numpy as np
mean = -0.546369
var = 0.006443
curr_sample = np.linspace(-10,10,10000)
prob = np.sum( 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) ) * (curr_sample[1]-curr_sample[0]) )
print prob
witch results in
0.99999999999961509
The formula you give is a probability density, not a probability. The density formula is such that when you integrate it between two values of x, you get the probability of being in that interval. However, this means that the probability of getting any particular sample is, in fact, 0 (it's the density times the infinitesimally small dx).
So what are you actually trying to calculate? You probably want something like the probability of getting your value or larger, the so-called tail probability, which is often used in statistics (it so happens that this is given by the error function when you're talking about a normal distribution, although you need to be careful of exactly how it's defined).
When considering the bell-shaped probability distribution function (PDF) of given mean and variance, the peak value of the curve (height of mode) is 1/sqrt(2*pi*var). It is 1 for standard normal distribution (mean 0 and var 1). Hence when trying to calculate a specific value of a general normal distribution pdf, values larger than 1 are possible.
I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.
from random import *
def main():
t = 0
for i in range(1000): # thousand
t += random()
print(t/1000)
main()
I was looking at the source code for a sample program my professor gave me and I came across this RNG. can anyone explain how this RNG works?
If you plotted the points, you would see that this actually produces a Gaussian ("normal") distribution about the mean of the random function.
Generate random numbers following a normal distribution in C/C++ talks about random number generation; it's a pretty common technique to do this if all you have is a uniform number generator like in standard C.
What I've given you here is a histogram of 100,000 values drawn from your function (of course, returned not printed, if you aren't familiar with python). The y axis is the frequency that the value appears, the x axis is the bin of the value. As you can see, the average value is 1/2, and by 3 standard deviations (99.7 percent of the data) we have almost no values in the range. That should be intuitive; we "usually" get 1/2, and very rarely get .99999
Have a look at the documentation. Its quite well written:
https://docs.python.org/2/library/random.html
The idea is that that program generates a random number 1000 times which is sufficiently enough to get mean as 0.5
The program is using the Central Limit Theorem - sums of independent and identically distributed random variables X with finite variance asymptotically converge to a normal (a.k.a. Gaussian) distribution whose mean is the sum of the means, and variance is the sum of the variances. Scaling this by N, the number of X's summed, gives the sample mean (a.k.a. average). If the expected value of X is μ and the variance of X is σ2, the expected value of the sample mean is also μ and it has variance σ2 / N.
Since a Uniform(0,1) has mean 0.5 and variance 1/12, your algorithm will generate results that are pretty close to normally distributed with a mean of 0.5 and a variance of 1/12000. Consequently 99.7% of the outcomes should fall within +/-3 standard deviations of the mean, i.e., in the range 0.5+/-0.0274.
This is a ridiculously inefficient way to generate normals. Better alternatives include the Box-Muller method, Polar method, or ziggurat method.
The thing making this random is the random() function being called. random() will generate 1 (for most practical purposes) random float between 0 and 1.
>>>random()
0.1759916412898097
>>>random()
0.5489228122596088
etc.
The rest of it is just adding each random to a total and then dividing by the number of randoms, essentially finding the average of all 1000 randoms, which as Cyber pointed out is actually not a random number at all.