I tried to generate the series of random numbers with gaussian distribution. So, I used numpy.random.normal(mean,standard deviation,size). However, when I converted these numbers into probabability density function using numpy.histogram, this was not same as Gaussian distribution with same mean and standard deviation made by matplotlib.mlab.normpdf.
I understand it may be because numpy.random.normal is random sampling. So, the PDF of these numbers can't be perfectly Gaussian.
Would you please give any advice about how to get the series of random numbers with mean and standard deviation which would have a Gaussian PDF, if it is possible?
The size of the numbers which I tried to get is 660.
I will really appreciate any advice and help.
Best regards,
Isaac
Well, you could "z-score" the sample, by subtracting the sample mean and then dividing by the sample standard deviation:
x = np.random.normal(0, 1, size=660)
x = (x - x.mean()) / x.std()
That will make your vector have a mean of 0 and a standard deviation of 1. But that doesn't mean you will have "perfectly gaussian random numbers." I don't think that's really a concept that makes sense.
It would be helpful to know what application you want to use this for, maybe then it would be easier to suggest alternatives.
Related
I have a number X of integers (very large) and a probability p with which I want to draw a sample s (a number) from X following a Poisson distribution. For example, if X = 10^8 and p=0.05, I expect s to be the number of heads we get.
I was able to easily do this with random.binomial as:
s=np.random.binomial(n=X, p=p)
How can I apply the same idea using random.poisson?
Just multiply p and X:
np.random.poisson(10**8 * 0.05)
The probability to get more than 10**8 is numerically zero.
Professor #pjs emphasizes that we are combining probability and number into a rate which is the parameter of the Poisson process.
Further worth mentioning that for such a large number you'll find the pmf's of Binomial and Poisson very similar to each other and also (using probability function or "cdf" as engineers call it) to a Gaussian.
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.poisson.html
import numpy as np
s = np.random.poisson(size=n, lam=p)
Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable
from random import *
def main():
t = 0
for i in range(1000): # thousand
t += random()
print(t/1000)
main()
I was looking at the source code for a sample program my professor gave me and I came across this RNG. can anyone explain how this RNG works?
If you plotted the points, you would see that this actually produces a Gaussian ("normal") distribution about the mean of the random function.
Generate random numbers following a normal distribution in C/C++ talks about random number generation; it's a pretty common technique to do this if all you have is a uniform number generator like in standard C.
What I've given you here is a histogram of 100,000 values drawn from your function (of course, returned not printed, if you aren't familiar with python). The y axis is the frequency that the value appears, the x axis is the bin of the value. As you can see, the average value is 1/2, and by 3 standard deviations (99.7 percent of the data) we have almost no values in the range. That should be intuitive; we "usually" get 1/2, and very rarely get .99999
Have a look at the documentation. Its quite well written:
https://docs.python.org/2/library/random.html
The idea is that that program generates a random number 1000 times which is sufficiently enough to get mean as 0.5
The program is using the Central Limit Theorem - sums of independent and identically distributed random variables X with finite variance asymptotically converge to a normal (a.k.a. Gaussian) distribution whose mean is the sum of the means, and variance is the sum of the variances. Scaling this by N, the number of X's summed, gives the sample mean (a.k.a. average). If the expected value of X is μ and the variance of X is σ2, the expected value of the sample mean is also μ and it has variance σ2 / N.
Since a Uniform(0,1) has mean 0.5 and variance 1/12, your algorithm will generate results that are pretty close to normally distributed with a mean of 0.5 and a variance of 1/12000. Consequently 99.7% of the outcomes should fall within +/-3 standard deviations of the mean, i.e., in the range 0.5+/-0.0274.
This is a ridiculously inefficient way to generate normals. Better alternatives include the Box-Muller method, Polar method, or ziggurat method.
The thing making this random is the random() function being called. random() will generate 1 (for most practical purposes) random float between 0 and 1.
>>>random()
0.1759916412898097
>>>random()
0.5489228122596088
etc.
The rest of it is just adding each random to a total and then dividing by the number of randoms, essentially finding the average of all 1000 randoms, which as Cyber pointed out is actually not a random number at all.
I'm trying to calculate standard deviation for some distribution and keep getting two different results from two paths. It doesn't make much sense to me - could someone explain why is this happening?
scipy.stats.binom(189, 100/189).std()
6.8622115305451707
scipy.stats.tstd([1]*100 + [0]*89)
0.50047821327986164
Why aren't those two numbers equal?
The basic reason is that you're taking the standard deviation of two quite different things there. I think you're misunderstanding what scipy.stats.binom does. From the documentation:
The probability mass function for binom is:
binom.pmf(k) = choose(n,k) * p**k * (1-p)**(n-k)
for k in {0,1,...,n}.
binom takes n and p as shape parameters.
When you do binom(189, 100/189), you are creating a distribution that could take on any value from 0 to 189. This distribution unsurprisingly has a much larger variance than the other sample data you're using, which is restricted to values of either zero or one.
It looks like what you want would be scipy.stats.binom(1, 100/189).std(). However, you still can't expect the exact same value as what you're getting with your sample data, because the binom.std is computing the standard deviation of the overall distribution, whereas the other version (scipy.stats.tstd([1]*100 + [0]*89)) is computing the standard deviation only of a sample. If you increase the size of your sample (e.g., do scipy.stats.tstd([1]*1000 + [0]*890)), the sample standard deviation will approach the value you're getting from binom.std.
You can also get the population (not sample) std by using scipy.std or numpy.std instead of scipy.stats.tstd. scipy.stats.tstd doesn't have a ddof option to let you choose the degrees of freedom, and always computes a sample std.
I am trying to perform an inverse sampling from a custom probability density function (PDF). I am just wondering if this even possible, i.e. integrating the PDF, inverting the result and then solving it for a given uniform number. The PDF has the shape f(x, alpha, mean(x))=(1/Gamma(alpha+1)(x))((x*(alpha+1)/mean(x))^(alpha+1))exp(-(alpha+1)*(x/mean(x)) where x > 0. From the shape the only values sub-150 are relevant, and for what I am trying to do the sub-80 values are good enough. Extending the range shouldnt be too hard though.
I have tried to do the inversion method, but only found a numerical way to do the integral, which isnt necessarily helpful considering that I need to invert the function to solve:
u = integral(f(x, alpha, mean(x))dx) from 0 to y, where y is unknown and u is uniform random variable between 0 and 1.
The integral has a gamma function and an incomplete gamma function, so trying to invert it is kind of a mess. Any help is welcome.
Thanks a bunch in advance.
Cheers
Assuming you mean that you're trying to randomly choose values which will be distributed according to your PDF, then yes, it is possible. This is described on Wikipedia as inverse transform sampling. Basically, it's just what you said: integrate the PDF to produce the cumulative distribution (CDF), invert it (which can be done ahead of time), and then choose a random number and run it through the inverted CDF.
If your domain is 0 to positive infinity, your distribution appears to match the gamma distribution which is built into Numpy and Scipy, with theta = 1/alpha and k = alpha+1.