I have a question about bivariate standard normal random variable.
Suppose
M(x,y,rho)=P
(X is less than x,Y is less than y) where X and Y are bivariate standard normal random variables with correlation rho.
How can I solve M in Python. Suppose I want to solveM(1,3,0.9). How can I solve it in Python?
I have searched numpy.random.multivariate_normal, but I am still confused.
You are trying to compute the CDF of a multivariate normal distribution. Here is a way to do it:
from scipy.stats import mvn
import numpy as np
low = np.array([-100, -100])
upp = np.array([1, 3])
mu = np.array([0, 0])
S = np.array([[1,0.9],[0.9,1]])
p,i = mvn.mvnun(low,upp,mu,S)
print p
The low bound array is an approximation in negative infinity. You can make those numbers more negative if you want more precision.
Related
I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.
I am trying to do an integral via importance sampling formula. The algorithm is as follows:
Code
The value for the integral should give: 0.838932960013382
Also I must use the next
probability distribution to generate 1,000,000 random numbers between 0 and 1. I also use the next weight function.
Finally with these numbers I must calculate this formula.
But I am getting the numerical value wrong, and I am not sure about the calculation for the 1.000.000 random numbers.
You don't compute weights normalization and inverse properly, check your math
Code below, Python 3.9, Win 10 x64
import numpy as np
from scipy import integrate
def f(x):
return 1.0/((np.exp(x)+1.0)*np.sqrt(x))
def w(x):
return 0.5/np.sqrt(x)
def inv(x):
return x*x
rng = np.random.default_rng()
N = 100000
x = rng.random(N)
p = inv(x)
q = f(p)/w(p)
print(np.mean(q))
print(integrate.quad(f, 0, 1))
prints
0.8389948486429488
(0.8389329600133858, 2.0727863869751673e-13)
looks good?
I'm trying to compute the integral of a Gaussian in python like so:
from math import exp
from scipy import stats, integrate
import scipy.interpolate as interpolate
from numpy import cumsum, random, histogram, linspace, zeros, inf, pi,sqrt
import matplotlib.pyplot as plt
A = 1
mu = 0
sigma = 1
p = lambda x: A * exp(-(((x-mu)**2))/(2*(sigma**2)))
F = lambda x: integrate.quad(p, -inf, x)[0]
Ns = 1000;
x = linspace(-50,50,Ns);
y = zeros(Ns)
yy = zeros(Ns)
for i in range(Ns):
y[i] = F(x[i])
yy[i]= p(x[i])
plt.plot(x,y)
plt.plot(x,yy)
plt.show()
but if one looks on the plot, there is a drop to zero between the range 21.0 to 22, and after 38+.
does anyone know why it is doing that? Rounding errors perhaps?
thanks!!
I think the key to understand this problem is to recall that numerical integration methods calculate a weighted sum of function values at specific knots.
The gaussian quickly goes to zero as you deviate from the mean, so basically on the interval between (-50, 50) most of the function values are zero. If the integration method fails to sample points from your small area where the function is non-zero, it will see the whole function as completely flat and thus gives you the integral 0.
So what can you do?
Instead of choosing the fixed interval of (-50,50), choose an interval based on smaller sigma values, to avoid integrating over an overly large interval of zeros.
If you go only 5, 10 or 20 standard deviations to the left and to the right, you will not see this issue, and you still have a very accurate integration result.
This is the result if you integrate from 10 standard deviations to the left and to the right.
What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.
AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()
I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.
Use numpy.random.zipf and just reject any samples greater than or equal to m
I want to use the gaussian function in python to generate some numbers between a specific range giving the mean and variance
so lets say I have a range between 0 and 10
and I want my mean to be 3 and variance to be 4
mean = 3, variance = 4
how can I do that ?
Use random.gauss. From the docs:
random.gauss(mu, sigma)
Gaussian distribution. mu is the mean, and sigma is the standard deviation. This is slightly
faster than the normalvariate() function defined below.
It seems to me that you can clamp the results of this, but that wouldn't make it a Gaussian distribution. I don't think you can satisfy all the constraints simultaneously. If you want to clamp it to the range [0, 10], you could get your numbers:
num = min(10, max(0, random.gauss(3, 4)))
But then the resulting distribution of numbers won't be truly Gaussian. In this case, it seems you can't have your cake and eat it, too.
There's probably a better way to do this, but this is the function I ended up creating to solve this problem:
import random
def trunc_gauss(mu, sigma, bottom, top):
a = random.gauss(mu,sigma))
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
return a
If we break it down line by line:
import random
This allows us to use functions from the random library, which includes a gaussian random number generator (random.gauss).
def trunc_gauss(mu, sigma, bottom, top):
The function arguments allow us to specify the mean (mu) and variance (sigma), as well as the top and bottom of our desired range.
a = random.gauss(mu,sigma))
Inside the function, we generate an initial random number according to a gaussian distribution.
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
Next, the while loop checks if the number is within our specified range, and generates a new random number as long as the current number is outside our range.
return a
As soon as the number is inside our range, the while loop stops running and the function returns the number.
This should give a better approximation of a gaussian distribution, since we don't artificially inflate the top and bottom boundaries of our range by rounding up or down the outliers.
I'm quite new to Python, so there are most probably simpler ways, but this worked for me.
I was working on some numerical analytical computation and I ran into this python tutorial site - http://www.python-course.eu/weighted_choice_and_sample.php
Now, this is what I proffer as a solution should anyone be too busy as to not hit the site.
I don't know how many gaussian values you need so I'll go with 100 as n, mu you gave as 3 and variance as 4 which makes sigma = 2. Here's the code:
from random import gauss
n = 100
values = []
frequencies = {}
while len(values) < n:
value = gauss(3, 2)
if 0 < value < 10:
frequencies[int(value)] = frequencies.get(int(value), 0) + 1
values.append(value)
print(values)
I hope this helps. You can get the plot as well. It's all in the tutorials.
If you have a small range of integers, you can create a list with a gaussian distribution of the numbers within that range and then make a random choice from it.
import numpy as np
from random import uniform
from scipy.special import erf,erfinv
import math
def trunc_gauss(mu, sigma,xmin=np.nan,xmax=np.nan):
"""Truncated Gaussian distribution.
mu is the mean, and sigma is the standard deviation.
"""
if np.isnan(xmin):
zmin=0
else:
zmin = erf((xmin-mu)/sigma)
if np.isnan(xmax):
zmax=1
else:
zmax = erf((xmax-mu)/sigma)
y = uniform(zmin,zmax)
z = erfinv(y)
# This will not come up often but if y >= 0.9999999999999999
# due to the truncation of the ervinv function max z = 5.805018683193454
while math.isinf(z):
z = erfinv(uniform(zmin,zmax))
return mu + z*sigma
You can use minimalistic code for 150 variables:
import numpy as np
s = np.random.normal(3,4,150) #<= mean = 3, variance = 4
print(s)
Normal distribution is another like random, stochastic distribution.
So, we can check it by:
import seaborn as sns
import matplotlib.pyplot as plt
AA1_plot = sns.distplot(s, kde=True, rug=False)
plt.show()