I am trying to call scipy.stats.multivariate_normal with four different parameters for mu and sigma. And then for each generated probability density function I need to call that pdf on an array of say, 10 values.
For simplicity let's say that above mentioned function is addXY:
def addXY(x, y):
return x+y
params=[[1,2],[1,3],[1,4],[1,5]] # mu and sigma, four versions
inputs=[1,2,3] # values, in this case 3 of them
matrix = []
for pdf_params in params:
row = []
for inp in inputs:
entry = addXY(*pdf_params)
row.append(entry*inp)
matrix.append(row)
print matrix
Is this pythonic?
Is there a way to pass params and inputs and get a matrix with all combinations in it that is more pythonic/vectorized/faster?
!Important notice: Inputs in the example are scalar values (I've set scalar values to simplify problem description, I am actually using array of n-dimensional vectors and thus multivariate_normal pdf).
Hints and tips about similar operations are welcome.
Based on your description of what you are trying to compute, you don't need multivariate_normal. You are calling the PDF method with a set of scalar values for a distribution with a scalar mu and sigma. So you can use the pdf() method of scipy.stats.norm. This method will broadcast its arguments, so by passing in arrays with the proper shape, you can compute the PDF for the different values of mu and sigma in one call. Here's an example.
Here are your x values (you called them inputs), and the parameters:
In [23]: x = np.array([1, 2, 3])
In [24]: params = np.array([[1, 2], [1, 3], [1, 4], [1, 5]])
For convenience, separate the parameters into arrays of mu and sigma values.
In [25]: mu = params[:, 0]
In [26]: sig = params[:, 1]
We'll use scipy.stats.norm to compute the PDF.
In [27]: from scipy.stats import norm
This call computes the PDF for the desired combinations of x and parameters. mu.reshape(-1, 1) and sig.reshape(-1, 1) are 2D arrays with shape (4, 1). x has shape (3,), so when these arguments are broadcast, the result has shape (4, 3). Each row is the PDF evaluated at x for one of the pairs of mu and sigma.
In [28]: p = norm.pdf(x, loc=mu.reshape(-1, 1), scale=sig.reshape(-1, 1))
In [29]: p
Out[29]:
array([[ 0.19947114, 0.17603266, 0.12098536],
[ 0.13298076, 0.12579441, 0.10648267],
[ 0.09973557, 0.09666703, 0.08801633],
[ 0.07978846, 0.07820854, 0.07365403]])
In other words, the rows of p are:
norm.pdf(x, loc=mu[0], scale=sig[0])
norm.pdf(x, loc=mu[1], scale=sig[1])
norm.pdf(x, loc=mu[2], scale=sig[2])
norm.pdf(x, loc=mu[3], scale=sig[3])
This is only my idea to shorten the code and utilize more library.
In your code, in fact, you do not use numpy, scipy. Question will be whether you would like to use numpy.array for further data processing.
Option 1: just use list to present array and list of list to present matrix:
from itertools import product
matrix_list = [sum(param)*input_x for param, input_x in product(params, inputs)]
matrix = zip(*[iter(matrix_list)]*len(inputs))
print matrix
Credit for using zip method should be given to
convert a flat list to list of list in python
Option 2: use numpy.array and numpy.matrix for further processing
from itertools import product
import numpy as np
matrix_array = np.array([sum(param)*input_x for param, input_x in product(params, inputs)])
matrix = matrix_array.reshape(len(params),len(inputs))
print matrix
Related
I created a cosine similarity method, which gives the correct results when called with indivdual vectors, but when I supply a list of vectors I suddenly get different results. Isn't numpy supposed to calculate the formula for every element in the list? Is my understanding wrong?
Cosine similarity:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
Example:
a = [1, 2, 3]
b = [4, 5, 6]
print(cosine_similarity(a, a), cosine_similarity(a, b), cosine_similarity(a, [a, b]))
With the result:
1.0 0.9746318461970762 [0.39223227 0.8965309 ]
The first two values are correct, the array of values should be the same, but isn't.
Is this just not possible or do I have to change something?
Your understanding is actually correct. Many functions in numpy allow the keyword argument axis to be specified on call. np.linalg.norm for example computes the norm along the specified axis. In your case, if it is not specified, norm calulates the norm of the 2x3 matrix [a, b] instead calculating the norm per row.
To fix the code just do the following:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2, axis=-1))
I need to generate multivariate Normal distribution using only a generator of a random value and without scipy or numpy generators.
I need to generate the following
This is my attempt
V = np.array([
[1, 2],
[2, 5]])
B = np.linalg.cholesky(V)
A = np.array([1,2])
# norm() return one number from standard normal distribution
n1 = np.array([norm() for _ in range(40)])
n2 = np.array([norm() for _ in range(40)])
np.array([n1,n2]).T.dot(B) + A
Here, I used Cholesky decomposition as in this post
However, I reckon this is not correct.
Your code is almost correct, but you can check that your numbers don't have the desired covariance property, if you apply numpy's cov function:
res = np.array([n1,n2]).T.dot(B) + A
np.cov(res.T).round()
# returns ~
# array([[5., 2.],
# [2., 1.]])
Note that the elements 1,1 and 2,2 are exchanged compared to the desired value.
To leverage numpy's CPU-vectorized matrix multiplication, you use numpy's dot function. You properly arranged the N pieces of 2D input vectors Z into a Nx2 dimensional vector np.array([n1,n2]).T. But as you pointed out in the Cholesky decomposition and variance question, the Z values have to be multiplied by B from the left, and you also would like to incorporate it into the dot function's broadcasting rule, and the problem lies here. The code np.array([n1,n2]).T.dot(B) multiplies the (array of) Z from the right, not from the left. To compute the left-product by B, you need to use dot(B.T)
This example also shows that the covariance matrix has the right form
import random
import numpy as np
random.seed(0)
N=10000
V = np.array([
[1, 2],
[2, 5]])
B = np.linalg.cholesky(V)
A = np.array([1, 2])
# norm() return one number from standard normal distribution
n1 = np.array([random.gauss(0, 1) for _ in range(10000)])
n2 = np.array([random.gauss(0, 1) for _ in range(10000)])
res = np.array([n1, n2]).T.dot(B.T) + A
np.cov(res.T).round()
# returns ~ array([[1., 2.],
# [2., 5.]])
In the fig. below the random points are plotted, together with the eigenvectors of the covariance matrix with a length of the square root of their eigenvalues, like on Wikipedia.
I have a matrix (numpy 2d array) in which each row is a valid probability distribution. I have another vector (numpy 1d array), again a prob dist. I need to compute KL divergence between each row of the matrix and the vector. Is it possible to do this without using for loops?
This question asks the same thing, but none of the answers solve my problem. One of them suggests to use for loop which I want to avoid since I have large data. Another answer provides a solution in tensorflow, but I want for numpy arrays.
scipy.stats.entropy computes KL divergence between 2 vectors, but I couldn't get how to use it when one of them is a matrix.
The function scipy.stats.entropy can, in fact, do the vectorized calculation, but you have to reshape the arguments appropriately for it to work. When the inputs are two-dimensional arrays, entropy expects the columns to hold the probability vectors. In the case where p is two-dimensional and q is one-dimensional, a trivial dimension must be added to q to make the arguments compatible for broadcasting.
Here's an example. First, the imports:
In [10]: import numpy as np
In [11]: from scipy.stats import entropy
Create a two-dimensional p whose rows are the probability vectors, and a one-dimensional probability vector q:
In [12]: np.random.seed(8675309)
In [13]: p = np.random.rand(3, 5)
In [14]: p /= p.sum(axis=1, keepdims=True)
In [15]: q = np.random.rand(5)
In [16]: q /= q.sum()
In [17]: p
Out[17]:
array([[0.32085531, 0.29660176, 0.14113073, 0.07988999, 0.1615222 ],
[0.05870513, 0.15367858, 0.29585406, 0.01298657, 0.47877566],
[0.1914319 , 0.29324935, 0.1093297 , 0.17710131, 0.22888774]])
In [18]: q
Out[18]: array([0.06804561, 0.35392387, 0.29008139, 0.04580467, 0.24214446])
For comparison with the vectorized result, here's the result computed using a Python loop.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
To make entropy do the vectorized calculation, the columns of the first argument must be the probability vectors, so we'll transpose p. Then, to make q compatible with p.T, we'll reshape it into a two-dimensional array with shape (5, 1) (i.e. it contains a single column):
In [20]: entropy(p.T, q.reshape(-1, 1))
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
Note: It is tempting to use q.T as the second argument, but that won't work. In NumPy, the transpose operation only swaps the lengths of existing dimensions--it never creates new dimensions. So the transpose of a one-dimensional array is itself. That is, q.T is the same shape as q.
Older version of this answer follows...
You can use scipy.special.kl_div or scipy.special.rel_entr to do this. Here's an example.
In [17]: import numpy as np
...: from scipy.stats import entropy
...: from scipy.special import kl_div, rel_entr
Make p and q for the example.
p has shape (3, 5); the rows are the probability distributions. q is a 1-d array with length 5.
In [18]: np.random.seed(8675309)
...: p = np.random.rand(3, 5)
...: p /= p.sum(axis=1, keepdims=True)
...: q = np.random.rand(5)
...: q /= q.sum()
This is the calculation that you want, using a Python loop and scipy.stats.entropy. I include this here so the result can be compared to the vectorized calculation below.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
We have constructed p and q so that the probability vectors
each sum to 1. In this case, the above result can also be
computed in a vectorized calculation with scipy.special.rel_entr or scipy.special.kl_div. (I recommend rel_entr. kl_div adds and subtracts additional terms that will ultimately cancel out in the sum, so it does a bit more work than necessary.)
These functions compute only the point-wise part of the calculations;
you have to sum the result to get the actual entropy or divergence.
In [20]: rel_entr(p, q).sum(axis=1)
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
In [21]: kl_div(p, q).sum(axis=1)
Out[21]: array([0.32253909, 0.17897139, 0.26279053])
I would like a numpy-sh way of vectorizing the calculation of eigenvalues, such that I can feed it a matrix of matrices and it would return a matrix of the respective eigenvalues.
For example, in the code below, B is the block 6x6 matrix composed of 4 copies of the 3x3 matrix A.
C is what I would like to see as output, i.e. an array of dimension (2,2,3) (because A has 3 eigenvalues).
This is of course a very simplified example, in the general case the matrices A can have any size (although they are still square), and the matrix B is not necessarily formed of copies of A, but different A1, A2, etc (all of same size but containing different elements).
import numpy as np
A = np.array([[0, 1, 0],
[0, 2, 0],
[0, 0, 3]])
B = np.bmat([[A, A], [A,A]])
C = np.array([[np.linalg.eigvals(B[0:3,0:3]),np.linalg.eigvals(B[0:3,3:6])],
[np.linalg.eigvals(B[3:6,0:3]),np.linalg.eigvals(B[3:6,3:6])]])
Edit: if you're using a version of numpy >= 1.8.0, then np.linalg.eigvals operates over the last two dimensions of whatever array you hand it, so if you reshape your input to an (n_subarrays, nrows, ncols) array you'll only have to call eigvals once:
import numpy as np
A = np.array([[0, 1, 0],
[0, 2, 0],
[0, 0, 3]])
# the input needs to be an array, since matrices can only be 2D.
B = np.repeat(A[np.newaxis,...], 4, 0)
# for arbitrary input arrays you could do something like:
# B = np.vstack(a[np.newaxis,...] for a in input_arrays)
# but for this to work it will be necessary for each element in
# 'input_arrays' to have the same shape
# eigvals will operate over the last two dimensions of the array and return
# a (4, 3) array of eigenvalues
C = np.linalg.eigvals(B)
# reshape this output so that it matches your original example
C.shape = (2, 2, 3)
If your input arrays don't all have the same dimensions, e.g. input_arrays[0].shape == (2, 2), input_arrays[1].shape == (3, 3) etc. then you could only vectorize this calculation across subsets with matching dimensions.
If you're using an older version of numpy then unfortunately I don't think there's any way to vectorize the calculation of the eigenvalues over multiple input arrays - you'll just have to loop over your inputs in Python instead.
You could just do something like this
C = np.array([[np.linalg.eigvals(B[i:i+3, j:j+3])
for i in xrange(0, B.shape[0], 3)]
for j in xrange(0, B.shape[1], 3)])
Perhaps a nicer approach is to use the block_view function from https://stackoverflow.com/a/5078155/1352250:
B_blocks = block_view(B)
C = np.array([[np.linalg.eigvals(m) for m in v] for v in B_blocks])
Update
As ali_m points out, this method is a form of syntactic sugar that will not reduce the overhead incurred from calling eigvals a large number of times. While this overhead should be small if each matrix it is applied to is large-ish, for the 6x6 matrices that the OP is interested in, it is not trivial (see the comments below; according to ali_m, there might be a factor of three difference between the version I give above, and the version he posted that uses Numpy >= 1.8.0).
Suppose I have two vectors of length 25, and I want to compute their covariance matrix. I try doing this with numpy.cov, but always end up with a 2x2 matrix.
>>> import numpy as np
>>> x=np.random.normal(size=25)
>>> y=np.random.normal(size=25)
>>> np.cov(x,y)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
Using the rowvar flag doesn't help either - I get exactly the same result.
>>> np.cov(x,y,rowvar=0)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
How can I get the 25x25 covariance matrix?
You have two vectors, not 25. The computer I'm on doesn't have python so I can't test this, but try:
z = zip(x,y)
np.cov(z)
Of course.... really what you want is probably more like:
n=100 # number of points in each vector
num_vects=25
vals=[]
for _ in range(num_vects):
vals.append(np.random.normal(size=n))
np.cov(vals)
This takes the covariance (I think/hope) of num_vects 1xn vectors
Try this:
import numpy as np
x=np.random.normal(size=25)
y=np.random.normal(size=25)
z = np.vstack((x, y))
c = np.cov(z.T)
Covariance matrix from samples vectors
To clarify the small confusion regarding what is a covariance matrix defined using two N-dimensional vectors, there are two possibilities.
The question you have to ask yourself is whether you consider:
each vector as N realizations/samples of one single variable (for example two 3-dimensional vectors [X1,X2,X3] and [Y1,Y2,Y3], where you have 3 realizations for the variables X and Y respectively)
each vector as 1 realization for N variables (for example two 3-dimensional vectors [X1,Y1,Z1] and [X2,Y2,Z2], where you have 1 realization for the variables X,Y and Z per vector)
Since a covariance matrix is intuitively defined as a variance based on two different variables:
in the first case, you have 2 variables, N example values for each, so you end up with a 2x2 matrix where the covariances are computed thanks to N samples per variable
in the second case, you have N variables, 2 samples for each, so you end up with a NxN matrix
About the actual question, using numpy
if you consider that you have 25 variables per vector (took 3 instead of 25 to simplify example code), so one realization for several variables in one vector, use rowvar=0
# [X1,Y1,Z1]
X_realization1 = [1,2,3]
# [X2,Y2,Z2]
X_realization2 = [2,1,8]
numpy.cov([X,Y],rowvar=0) # rowvar false, each column is a variable
Code returns, considering 3 variables:
array([[ 0.5, -0.5, 2.5],
[-0.5, 0.5, -2.5],
[ 2.5, -2.5, 12.5]])
otherwise, if you consider that one vector is 25 samples for one variable, use rowvar=1 (numpy's default parameter)
# [X1,X2,X3]
X = [1,2,3]
# [Y1,Y2,Y3]
Y = [2,1,8]
numpy.cov([X,Y],rowvar=1) # rowvar true (default), each row is a variable
Code returns, considering 2 variables:
array([[ 1. , 3. ],
[ 3. , 14.33333333]])
Reading the documentation as,
>> np.cov.__doc__
or looking at Numpy Covariance, Numpy treats each row of array as a separate variable, so you have two variables and hence you get a 2 x 2 covariance matrix.
I think the previous post has right solution. I have the explanation :-)
I suppose what youre looking for is actually a covariance function which is a timelag function. I'm doing autocovariance like that:
def autocovariance(Xi, N, k):
Xs=np.average(Xi)
aCov = 0.0
for i in np.arange(0, N-k):
aCov = (Xi[(i+k)]-Xs)*(Xi[i]-Xs)+aCov
return (1./(N))*aCov
autocov[i]=(autocovariance(My_wector, N, h))
You should change
np.cov(x,y, rowvar=0)
onto
np.cov((x,y), rowvar=0)
What you got (2 by 2) is more useful than 25*25. Covariance of X and Y is an off-diagonal entry in the symmetric cov_matrix.
If you insist on (25 by 25) which I think useless, then why don't you write out the definition?
x=np.random.normal(size=25).reshape(25,1) # to make it 2d array.
y=np.random.normal(size=25).reshape(25,1)
cov = np.matmul(x-np.mean(x), (y-np.mean(y)).T) / len(x)
As pointed out above, you only have two vectors so you'll only get a 2x2 cov matrix.
IIRC the 2 main diagonal terms will be sum( (x-mean(x))**2) / (n-1) and similarly for y.
The 2 off-diagonal terms will be sum( (x-mean(x))(y-mean(y)) ) / (n-1). n=25 in this case.
according the document, you should expect variable vector in column:
If we examine N-dimensional samples, X = [x1, x2, ..., xn]^T
though later it says each row is a variable
Each row of m represents a variable.
so you need input your matrix as transpose
x=np.random.normal(size=25)
y=np.random.normal(size=25)
X = np.array([x,y])
np.cov(X.T)
and according to wikipedia: https://en.wikipedia.org/wiki/Covariance_matrix
X is column vector variable
X = [X1,X2, ..., Xn]^T
COV = E[X * X^T] - μx * μx^T // μx = E[X]
you can implement it yourself:
# X each row is variable
X = X - X.mean(axis=0)
h,w = X.shape
COV = X.T # X / (h-1)
i don't think you understand the definition of covariance matrix.
If you need 25 x 25 covariance matrix, you need 25 vectors each with n data points.