finding probability in multivariate normal distribution - python

I am using python. I know that to find probability in multivariate normal distribution I have to use following:
fx(x1,…,xk) = (1/√(2π)^k|Σ|) * exp(−1/2(x−μ)T* Σ^-1 *(x−μ))
where x = [x1, x2]
I have different values of x1 and x2.
but here I have to find probability for:
0.5< x1<1.5 and 4.5< x2<5.5
I know how to use this formula for single value of x1 and x2. But I am confused in this case. Please help.

What you need to so is find the area beneath the function for the rectangle bounded by 0.5 < x1 < 1.5 and 4.5 < x2 < 5.5.
As a quick and dirty solution, you could use this code to do a two-variable Reimann sum to estimate the integral. A Reimann sum just divides the rectangle into small regions and approximates the area under each region as if the function was flat.
Provided you've defined your distribution as the function f.
x1Low = 0.5
x1Hi = 1.5
x2Low = 4.5
x2Hi = 5.5
x1steps = 1000
x2steps = 1000
x1resolution = (x1Hi-x1Low)/x1steps
x2resolution = (x2Hi-x2Low)/x2steps
area = x1resolution*x2resolution
x1vals = [x1Low + i*x1resolution for i in range(x1steps)]
x2vals = [x2Low + i*x2resolution for i in range(x2steps)]
sum = 0;
for i in range(len(x1vals-1)):
for j in range(len(x2vals-1)):
sum += area * f(x1vals[i],x2vals[j])
print sum
Keep in mind that this sum is only an approximation, and not a great one either. It will seriously over- or under-estimate the integral in regions where the function changes too quickly.
If you need more accuracy, you can try implementing triangle rule or simpsons's rule, or look into scipy's numerical integration tools.

Related

Python: How can I test for a relatively straight line in a series of cubic lines?

I have a collection of curved lines, representing the third degree polynomial line of best fit for some datasets.
I want to differentiate relatively flat lines, filtering these plots, for further analyses.
For example I want to filter subplots 20935, 21004, 21010, 18761, 21037.
How can I do this, with a list of floats as input for these lines?
(using Python 3.8, Numpy, Math, mathplotlib in an anaconda env)
If you have got a list of xs and their respective ys, you can compute the slope for each point and check if the slope is always a constant value.
threshold = 0.001 # add your precision here. zero indicates a perfect straight line
is_straight_line = True
slope = (y[1]-y[0]) / (x[1] - x[0])
for i, (xval, yval) in enumerate(zip(x[2:], y[2:])):
s = (yval - y[i-1]) / (xval - x[i-1])
if abs(s - slope) > threshold:
is_straight_line = False
break
print(is_straight_line)
if you need the computation to be efficient, you should consider using numpy instead.
Knowledge of first-year calculus is assumed. There's a geometric property called "curvature" that basically determines how much a shape bends at a certain point (really the inverse of the radius of the osculating circle at that point).
We can use this link to develop a formula for a cubic function with coefficients [a, b, c, d] at x = x.
def cubic_curvature(a, b, c, d, x):
k = abs(6*a*x + 2*b) / (1 + (3*a*x**2 + 2*b*x + c)**2) ** 1.5
return k
More general algorithms can be created for any polynomial, possibly with assistance from the sympy library depending on your needs.
With this in mind, you can set some threshold for curvature that determines whether the cubic is "straight" enough given its coefficients (I believe scipy or similar should be able to give you these from a list of points) and the x-value to be evaluated at (try the median independent variable).

How to estimate the integeral of an oscillating curve using the Monte-Carlo method (in python)

I am trying to estimate the integral below using the Monte-Carlo method (in python):
I am using 1000 random points to estimate the integral. Here's my code:
N = 1000 #total number of points to be generated
def f(x):
return x*np.cos(x)
##Points between the x-axis and the curve will be stored in these empty lists.
red_points_x = []
red_points_y = []
blue_points_x = []
blue_points_y = []
##The loop checks if a point is between the x-axis and the curve or not.
i = 0
while i < N:
x = random.uniform(0, 2*np.pi)
y = random.uniform(3.426*np.cos(3.426), 2*np.pi*np.cos(2*np.pi))
if (0<= x <= np.pi and 0<= y <= f(x)) or (np.pi/2 <= x <= 3*np.pi/2 and f(x) <= y <= 0) or (3*np.pi/2 <= x <= 2*np.pi and 0 <= y <= f(x)):
red_points_x.append(x)
red_points_y.append(y)
else:
blue_points_x.append(x)
blue_points_y.append(y)
i +=1
area_of_rectangle= (2*np.pi)*(2*np.pi*np.cos(2*np.pi))
area= area_of_rectangle*(len(red_points_x))/N
print(area)
Output:
7.658813015245341
But that's far from 0 (the analytic solution)
Here's a visual representation of the area I am trying to plot:
Am I doing something wrong or missing something in my code? Please help, your help will be much appreciated. Thank you so much in advance.
TLDR: I believe the way you calculate the approximation is slightly wrong.
Looking a the wikipedia definition of the Monte Carlo integration the following definition is made:
https://en.wikipedia.org/wiki/Monte_Carlo_integration#Example
V corresponds the volume (area in this case) of the region of interest, x = [0, 2pi], y = [3.426*cos(3.426), 2pi*cos(2pi)].
So Q_N is the volume divided by N times the sum of the function evaluated at the randomly generated points. Hence:
total = 0
while i < N:
x = random.uniform(0, 2 * np.pi)
total += f(x)
i += 1
area_of_rectangle = (2*np.pi)*(2*np.pi*np.cos(2*np.pi)-3.426 * np.cos(3.426))
area = (area_of_rectangle * total) / N
This code yielded an average result of 0.0603 for 1000 runs with N=1000 (to remove the influence of randomly generated values). As you increase N the accuracy increases.
You are on the right track!
A couple pointers to put you on course...
Make your bounding box bigger in the y dimension to alleviate some of the confusing math. Yes, it will converge faster if you get it to "just touch" the max and min, but don't shoot for that yet. Heck, just make it -5 < y < 10 and you will have a nice (larger) box that covers the area you want to integrate. So, change your y generation to that and also change the area of your box calculation
Don't change x, you have it right 0 < x < 2*pi
When you are comparing the point to see if it is "under the curve" you do NOT need to check the x value... right? Just check if y is between f(x) and the axis. More on this in next point.... if so, it is "red"
Also on the point above, you will also need another category for the points that are BELOW the x-axis, because you will want to reduce your total by that amount. An alternate "trick" is to shift your whole function up by some constant such that the entire integral is positive, and then reduce your total by the size of that rectangle (constant * width)
Also, as you work on this, plot your points with matplotlib, it should be very easy the way you have your points gathered to overlay scatter plots with what you have and see if it looks visually accurate!
Comment me back w/ further q's... you got this!

Monte Carlo Integration with Python

Estimate the following integral with Monte Carlo integration:
I am trying to do Monte Carlo Integration on the problem below, where p(x) is a Gaussian distribution with a mean of 1 and a variance of 2. (see image).
I was told that once we draw samples from a normal distribution the pdf vanishes in the integral. Please explain this concept and how do I solve this in Python. Below is my attempt.
def func(x):
return (math.exp(x))*x
mu = 1
sigma = sqrt(2)
N = 1000
areas = []
for i in range(N):
xrand = np.zeros(N)
for i in range (len(xrand)):
xrand[i] = np.random.normal(mu, sigma)
integral = 0.0
for i in range (N):
integral += func(xrand[i])/N
answer = integral
areas.append(answer)
plt.title("Distribution of areas calculated")
plt.hist(areas, 60, ec = 'black')
plt.xlabel("Areas")
integral
Monte Carlo integration is a way of approximating complex integrals without computing their closed form solution. To answer your question, the PDF vanishes because all you need to do is to 1) sample some random value from the specified normal distribution, 2) calculate the value of the function in the integrand, and 3) compute the average of these values. Note that the PDF becomes irrelevant in the computation; it’s only relevant insofar as assuring that more likely values are more frequently sampled. You might understand this as taking the weighted average, if that makes things more intuitive.
Here is a Python implementation based on your original source code.
def func(x):
return x * math.exp(x)
def monte_carlo(n_sample, mu, sigma):
val_lst = []
for _ in range(n_sample):
x = np.random.normal(mu, sigma)
val_lst.append(func(x))
return mean(val_lst)
You can change func to be any function of your choice to perform a Monte Carlo approximation of that function. You can also edit the parameters of the monte_carlo function if you are given a different probability distribution.
Here is a function you can use to visualize the gradual convergence of the Monte Carlo approximation. As you might expect, the values will converge with larger iterations, i.e. as you increase the value of n_sample.
MAX_SAMPLE = 200 # Adjust this value as you need
x = np.arange(MAX_SAMPLE)
y = [monte_carlo(i, 1, sqrt(2)) for i in x]
plt.plot(x, y)
plt.show()
The resulting plot will show you the value of convergence, which is approximation of the value computed from the closed form solution of the definite integral.

Polynomial regression (scikit learn), finding roots (Y = 0)

I have used scikit learn polynomial regression to fit a polynomial of order 4 to some data. I am interested in finding the roots or in other words at which point the curve crosses the x-axis when y = 0. When I looked at numpy's polynomial class, I found that they have a function called roots to do this. Is there something similar to this with scikit learn?
I have written the code below which checks for change from negative to positive giving me an approximate location of where the curve crosses the x-axis when y = 0. But I want to find if there is a better method.
#Equation for reference
YY4 = clf4.intercept_[0] + clf4.coef_[0][1] * XX + clf4.coef_[0][2]*np.power(XX,2) + clf4.coef_[0][3]*np.power(XX,3) + clf4.coef_[0][4]*np.power(XX,4)
def neutralState(Y): #Where Y takes the y-axis values
for i in range(1,len(Y)):
if (Y[i-1] < 0 and Y[i] > 0) or (Y[i-1] > 0 and Y[i] < 0): #ignore the very first value
NS_val = (Y[i])
NS_index = (i)
return(NS_val,NS_index) #return the closest value to zero
print(neutralState(YY4))

Calculate the exact integral in Python

I need to write a python code to calculate the exact value of the integral (-5, 5) of 1/(1+x^2).
I know the answer is 2arctan(5) which is roughly equivalent to 2.746801...
I have below the code I have written, however I am getting a slightly different answer and I was wondering if there is anything I can do to make this code more accurate? Thanks for any help!
## The function to be integrated
def func(x):
return 1/(1 + x**2)
## Defining variables
a = -5.0
b = 5.0
dx = 1.0
Area = 0
## Number of trapezoids
n = int((b-a)/dx)
## Loop to calculate area and sum
for i in range(1, n+1):
x0 = a + (i-1)*dx
x1 = a + i*dx
## Area of each trapezoid
Ai = dx*(func(x0) + func(x1))/2.0
## Cumulative sum of areas
Area = Area + Ai
print("The exact value is: ", Area)
The answer I am getting is 2.756108...
I know it's a small difference, however, it is a difference and I would like to try for something more exact.
The reason you are getting an approximate value for the integral is because you are using an approximation technique (a first-order approximation to compute the value of the definite integral).
There are two ways to evaluate an integral: analytically or numerically (by approximation). Your method is of the second variety, and since it's an approximation it will generate a value that is within a certain margin of error of the real value.
The point of my answer is that there is no way for you to calculate the exact value of the integral using a numeric approach (definitely not in the case of this function). So you will have to settle for a certain margin of error that you're willing to accept and then choose a delta-x sufficiently small to get you within that range.

Categories