Discrepancy between log_prob and manual calculation - python

I want to define multivariate normal distribution with mean [1, 1, 1] and variance covariance matrix with 0.3 on diagonal. After that I want to calculate log likelihood on datapoints [2, 3, 4]
By torch distributions
import torch
import torch.distributions as td
input_x = torch.tensor([2, 3, 4])
loc = torch.ones(3)
scale = torch.eye(3) * 0.3
mvn = td.MultivariateNormal(loc = loc, scale_tril=scale)
mvn.log_prob(input_x)
tensor(-76.9227)
From scratch
By using formula for log likelihood:
We obtain tensor:
first_term = (2 * np.pi* 0.3)**(3)
first_term = -np.log(np.sqrt(first_term))
x_center = input_x - loc
tmp = torch.matmul(x_center, scale.inverse())
tmp = -1/2 * torch.matmul(tmp, x_center)
first_term + tmp
tensor(-24.2842)
where I used fact that
My question is - what's the source of this discrepancy?

You are passing the covariance matrix to the scale_tril instead of covariance_matrix. From the docs of PyTorch's Multivariate Normal
scale_tril (Tensor) – lower-triangular factor of covariance, with positive-valued diagonal
So, replacing scale_tril with covariance_matrix would yield the same results as your manual attempt.
In [1]: mvn = td.MultivariateNormal(loc = loc, covariance_matrix=scale)
In [2]: mvn.log_prob(input_x)
Out[2]: tensor(-24.2842)
However, it's more efficient to use scale_tril according to the authors:
...Using scale_tril will be more efficient:
You can calculate the lower choelsky using torch.linalg.cholesky
In [3]: mvn = td.MultivariateNormal(loc = loc, scale_tril=torch.linalg.cholesky(scale))
In [4]: mvn.log_prob(input_x)
Out[4]: tensor(-24.2842)

Related

Find the point of intersection of two linear equations using Numpy

The objective is to find the point of intersection of two linear equations. These two linear equation are derived using the Numpy polyfit functions.
Given two time series (xLeft, yLeft) and (xRight, yRight), the linear least suqares fit to each of them was calculated using polyfit as shown below:
xLeft = [
6168, 6169, 6170, 6171, 6172, 6173, 6174, 6175, 6176, 6177,
6178, 6179, 6180, 6181, 6182, 6183, 6184, 6185, 6186, 6187
]
yLeft = [
0.98288751, 1.3639959, 1.7550986, 2.1539073, 2.5580614,
2.9651523, 3.3727503, 3.7784295, 4.1797948, 4.5745049,
4.9602985, 5.3350167, 5.6966233, 6.0432272, 6.3730989,
6.6846867, 6.9766307, 7.2477727, 7.4971657, 7.7240791
]
xRight = [
6210, 6211, 6212, 6213, 6214, 6215, 6216, 6217, 6218, 6219,
6220, 6221, 6222, 6223, 6224, 6225, 6226, 6227, 6228, 6229,
6230, 6231, 6232, 6233, 6234, 6235, 6236, 6237, 6238, 6239,
6240, 6241, 6242, 6243, 6244, 6245, 6246, 6247, 6248, 6249,
6250, 6251, 6252, 6253, 6254, 6255, 6256, 6257, 6258, 6259,
6260, 6261, 6262, 6263, 6264, 6265, 6266, 6267, 6268, 6269,
6270, 6271, 6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279,
6280, 6281, 6282, 6283, 6284, 6285, 6286, 6287, 6288]
yRight=[
7.8625913, 7.7713094, 7.6833806, 7.5997391, 7.5211883,
7.4483986, 7.3819046, 7.3221073, 7.2692747, 7.223547,
7.1849418, 7.1533613, 7.1286001, 7.1103559, 7.0982385,
7.0917811, 7.0904517, 7.0936642, 7.100791, 7.1111741,
7.124136, 7.1389918, 7.1550579, 7.1716633, 7.1881566,
7.2039142, 7.218349, 7.2309117, 7.2410989, 7.248455,
7.2525721, 7.2530937, 7.249711, 7.2421637, 7.2302341,
7.213747, 7.1925621, 7.1665707, 7.1356878, 7.0998487,
7.0590014, 7.0131001, 6.9621005, 6.9059525, 6.8445964,
6.7779589, 6.7059474, 6.6284504, 6.5453324, 6.4564347,
6.3615761, 6.2605534, 6.1531439, 6.0391097, 5.9182019,
5.7901659, 5.6547484, 5.5117044, 5.360805, 5.2018456,
5.034656, 4.8591075, 4.6751242, 4.4826899, 4.281858,
4.0727611, 3.8556159, 3.6307325, 3.3985188, 3.1594861,
2.9142516, 2.6635408, 2.4081881, 2.1491354, 1.8874279,
1.6242117,1.3607255,1.0982931,0.83831298
]
left_line = np.polyfit(xleft, yleft, 1)
right_line = np.polyfit(xRight, yRight, 1)
In this case, polyfit outputs the coeficients m and b for y = mx + b, respectively.
The intersection of the two linear equations then can be calculated as follows:
x0 = -(left_line[1] - right_line[1]) / (left_line[0] - right_line[0])
y0 = x0 * left_line[0] + left_line[1]
However, I wonder whether there exist Numpy build-in approach to calculate the last two steps?
Not exactly a built-in approach, but you can simplify the problem. Say I have lines given my y = m1 * x + b1 and y = m2 * x + b2. You can trivially find an equation for the difference, which is also a line:
y = (m1 - m2) * x + (b1 - b2)
Notice that this line will have a root at the intersection of the two original lines, if they intersect. You can use the numpy.polynomial.Polynomial class to perform these operations:
>>> (np.polynomial.Polynomial(left_line[::-1]) - np.polynomial.Polynomial(right_line[::-1])).roots()
array([6192.0710885])
Notice that I had to swap the order of the coefficients, since Polynomial expects smallest to largest, while np.polyfit returns the opposite. In fact, np.polyfit is not recommended. Instead, you can get Polynomial obejcts directly using np.polynomial.Polynomial.fit class method. Your code would then look like:
left_line = np.polynomial.Polynomial.fit(xLeft, yLeft, 1, domain=[-1, 1])
right_line = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=[-1, 1])
x0 = (left_line - right_line).roots()
y0 = left_line(x0)
The domain is mapped to the window [-1, 1]. If you do not specify a domain, the peak-to-peak of the x-values will be used instead. You do not want this, since it will result in a mapping of the input values. Instead, we explicitly specify that the domain [-1, 1] maps to the same window. An alternative would be to use the default domain and set e.g. window=[xLeft.min(), xLeft.max()]. The problem with this approach is that it would then create different domains for the polynomials, preventing the operation left_line - right_line.
See https://numpy.org/doc/stable/reference/routines.polynomials.classes.html for more information.
You can model it as a linear system and use simple linear algebra:
def get_intersection(m1,b1,m2,b2):
A = np.array([[-m1, 1], [-m2, 1]])
b = np.array([[b1], [b2]])
# you have to solve linear System AX = b where X = [x y]'
X = np.linalg.pinv(A) # b
x, y = np.round(np.squeeze(X), 4)
return x, y # returns point of intersection (x,y) with 4 decimal precision
m1,b1,m2,b2 = left_line(0), left_line(1), right_line(0), right_line(1)
print(get_intersection(m1,b1,m2,b2))
As an example, for lines y - x = 1, and y + x = 1, we expect the intersection as (0,1):
m1,b1,m2,b2 = 1, 1, -1, 1
print(get_intersection(m1,b1,m2,b2))
Output: (0.0, 1.0) as expected.

Why does it work when columns are larger than rows in Python Sklearn (Linear Regression) [duplicate]

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
Out[31]:
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
EDIT:
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
else:
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2
When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
print(coef1)
print(coef2)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

Vectorize calculating all unit vectors for a set of points in numpy

I need to calculate all unit vectors between two sets of points.
Currently I have this:
def all_unit_vectors(points_a, points_b):
results = np.zeros((len(points_a) * len(points_b), 3), dtype=np.float32)
count = 0
for pt_a in points_a:
for pt_b in points_b:
results[count] = (pt_a - pt_b)/np.linalg.norm([pt_a - pt_b])
count += 1
return results
in_a = np.array([[51.34, 63.68, 7.98],
[53.16, 63.23, 7.19],
[77.50, 62.55, 4.23],
[79.54, 62.73, 3.61]])
in_b = np.array([[105.58, 61.09, 5.50],
[107.37, 60.66, 6.50],
[130.73, 58.30, 12.33],
[132.32, 58.48, 13.38]])
results = all_unit_vectors(in_a, in_b)
print(results)
which (correctly) outputs:
[[-0.368511 0.01759667 0.01684932]
[-0.3777128 0.02035861 0.00997707]
[-0.47964868 0.03250422 -0.02628129]
[-0.4851439 0.03115273 -0.03235091]
[-0.3551545 0.01449887 0.01145004]
[-0.3644423 0.01727756 0.00463872]
[-0.46762046 0.02971985 -0.03098581]
[-0.4732132 0.02839518 -0.03700341]
[-0.17814296 0.00926242 -0.00805704]
[-0.18821244 0.01190899 -0.01430339]
[-0.3044056 0.02430441 -0.04632135]
[-0.31113514 0.0230996 -0.05193153]
[-0.16408844 0.0103343 -0.01190965]
[-0.1741932 0.01295652 -0.01808905]
[-0.29113463 0.02519489 -0.04959355]
[-0.29793915 0.02399093 -0.05515092]]
Can the loops in all_unit_vectors() be vectorized?
norm is calculated as root of sum squared, you can implement your own norm calculation as follows, and then vectorize your solution with broadcasting:
diff = (in_a[:, None] - in_b).reshape(-1, 3)
norm = ((in_a[:, None] ** 2 + in_b ** 2).sum(2) ** 0.5).reshape(-1, 1)
diff / norm
gives:
[[-0.36851098 0.01759667 0.01684932]
[-0.3777128 0.02035861 0.00997706]
[-0.47964868 0.03250422 -0.02628129]
[-0.4851439 0.03115273 -0.03235091]
[-0.35515452 0.01449887 0.01145004]
[-0.36444229 0.01727756 0.00463872]
[-0.46762047 0.02971985 -0.03098581]
[-0.4732132 0.02839518 -0.03700341]
[-0.17814297 0.00926242 -0.00805704]
[-0.18821243 0.01190899 -0.01430339]
[-0.30440561 0.02430441 -0.04632135]
[-0.31113513 0.0230996 -0.05193153]
[-0.16408845 0.0103343 -0.01190965]
[-0.1741932 0.01295652 -0.01808905]
[-0.29113461 0.02519489 -0.04959355]
[-0.29793917 0.02399093 -0.05515092]]
Play.

numpy vectorized approach to regression -multiple dependent columns (x) on single independent columns (y)

consider the below (3, 13) np.array
from scipy.stats import linregress
a = [-0.00845,-0.00568,-0.01286,-0.01302,-0.02212,-0.01501,-0.02132,-0.00783,-0.00942,0.00158,-0.00016,0.01422,0.01241]
b = [0.00115,0.00623,0.00160,0.00660,0.00951,0.01258,0.00787,0.01854,0.01462,0.01479,0.00980,0.00607,-0.00106]
c = [-0.00233,-0.00467,0.00000,0.00000,-0.00952,-0.00949,-0.00958,-0.01696,-0.02212,-0.01006,-0.00270,0.00763,0.01005]
array = np.array([a,b,c])
yvalues = pd.to_datetime(['2019-12-15','2019-12-16','2019-12-17','2019-12-18','2019-12-19','2019-12-22','2019-12-23','2019-12-24',\
'2019-12-25','2019-12-26','2019-12-29','2019-12-30','2019-12-31'], errors='coerce')
I can run the OLS regression on one column at a time successfully, as in below:
out = linregress(array[0], y=yvalues.to_julian_date())
print(out)
LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252878, stderr=105.71465449878443)
However, what i wish to accomplish is to: run the regression on the matrix array with 'y' variable (yvalues) being constant for all columns -in one go (loop is possible solution but tiresome). I tried to extend 'yvalues' to match array shape with (np.tile). but is seems not to be the right approach. thank you all for your help.
IIUC you are looking for something like the following list comprehension in a vectorized way:
out = [linregress(array[i], y=yvalues.to_julian_date()) for i in range(array.shape[0])]
out
[LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252876, stderr=105.71465449878443),
LinregressResult(slope=178.44888292241782, intercept=2458838.7056912296, rvalue=0.1911788042719021, pvalue=0.5315353013148307, stderr=276.24376878908953),
LinregressResult(slope=106.86168938856262, intercept=2458840.7656617565, rvalue=0.17721031419860186, pvalue=0.5624701260912525, stderr=178.940293876864)]
To be honest I've never seen what you are looking for implemented using scipy or statsmodels functionalities.
Therefore we can implement it ourselves exploiting numpy broadcasting:
x = array
y = np.array(yvalues.to_julian_date())
# mean of our inputs and outputs
x_mean = np.mean(x, axis=1)
y_mean = np.mean(y)
#total number of values
n = x.shape[1]
# using the formula to calculate the slope and intercept
n = np.sum((x - x_mean[:,np.newaxis]) * (y - y_mean)[np.newaxis,:], axis=1)
d = np.sum((x - x_mean[:,np.newaxis])**2, axis=1)
slopes = n/d
intercepts = y_mean - slopes*x_mean
slopes
array([329.14108704, 178.44888292, 106.86168939])
intercepts
array([2458842.41173136, 2458838.70569123, 2458840.76566176])

Root mean square of a function in python

I want to calculate root mean square of a function in Python. My function is in a simple form like y = f(x). x and y are arrays.
I tried Numpy and Scipy Docs and couldn't find anything.
I'm going to assume that you want to compute the expression given by the following pseudocode:
ms = 0
for i = 1 ... N
ms = ms + y[i]^2
ms = ms / N
rms = sqrt(ms)
i.e. the square root of the mean of the squared values of elements of y.
In numpy, you can simply square y, take its mean and then its square root as follows:
rms = np.sqrt(np.mean(y**2))
So, for example:
>>> y = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1]) # Six 1's
>>> y.size
10
>>> np.mean(y**2)
0.59999999999999998
>>> np.sqrt(np.mean(y**2))
0.7745966692414834
Do clarify your question if you mean to ask something else.
You could use the sklearn function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual,[0 for _ in y_actual], squared=False)
numpy.std(x) tends to rms(x) in cases of mean(x) value tends to 0 (thanks to #Seb), like it can be with sound records, vibrations, and other signals of fluctuations from zero.
rms = lambda x_seq: (sum(x*x for x in x_seq)/len(x_seq))**(1/2)
In case you'd like to frame your array before compute RMS, this is a numpy solution:
nframes = 1000
rms = np.array([
np.sqrt(np.mean(arr**2))
for arr in np.array_split(arr,nframes)
])
If you'd like to specify frame length instead of frame counts, you'd do this first:
frame_length = 200
arr_length = arr.shape[0]
nframes = arr_length // frame_length +1

Categories