Compute KL divergence between rows of a matrix and a vector - python

I have a matrix (numpy 2d array) in which each row is a valid probability distribution. I have another vector (numpy 1d array), again a prob dist. I need to compute KL divergence between each row of the matrix and the vector. Is it possible to do this without using for loops?
This question asks the same thing, but none of the answers solve my problem. One of them suggests to use for loop which I want to avoid since I have large data. Another answer provides a solution in tensorflow, but I want for numpy arrays.
scipy.stats.entropy computes KL divergence between 2 vectors, but I couldn't get how to use it when one of them is a matrix.

The function scipy.stats.entropy can, in fact, do the vectorized calculation, but you have to reshape the arguments appropriately for it to work. When the inputs are two-dimensional arrays, entropy expects the columns to hold the probability vectors. In the case where p is two-dimensional and q is one-dimensional, a trivial dimension must be added to q to make the arguments compatible for broadcasting.
Here's an example. First, the imports:
In [10]: import numpy as np
In [11]: from scipy.stats import entropy
Create a two-dimensional p whose rows are the probability vectors, and a one-dimensional probability vector q:
In [12]: np.random.seed(8675309)
In [13]: p = np.random.rand(3, 5)
In [14]: p /= p.sum(axis=1, keepdims=True)
In [15]: q = np.random.rand(5)
In [16]: q /= q.sum()
In [17]: p
Out[17]:
array([[0.32085531, 0.29660176, 0.14113073, 0.07988999, 0.1615222 ],
[0.05870513, 0.15367858, 0.29585406, 0.01298657, 0.47877566],
[0.1914319 , 0.29324935, 0.1093297 , 0.17710131, 0.22888774]])
In [18]: q
Out[18]: array([0.06804561, 0.35392387, 0.29008139, 0.04580467, 0.24214446])
For comparison with the vectorized result, here's the result computed using a Python loop.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
To make entropy do the vectorized calculation, the columns of the first argument must be the probability vectors, so we'll transpose p. Then, to make q compatible with p.T, we'll reshape it into a two-dimensional array with shape (5, 1) (i.e. it contains a single column):
In [20]: entropy(p.T, q.reshape(-1, 1))
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
Note: It is tempting to use q.T as the second argument, but that won't work. In NumPy, the transpose operation only swaps the lengths of existing dimensions--it never creates new dimensions. So the transpose of a one-dimensional array is itself. That is, q.T is the same shape as q.
Older version of this answer follows...
You can use scipy.special.kl_div or scipy.special.rel_entr to do this. Here's an example.
In [17]: import numpy as np
...: from scipy.stats import entropy
...: from scipy.special import kl_div, rel_entr
Make p and q for the example.
p has shape (3, 5); the rows are the probability distributions. q is a 1-d array with length 5.
In [18]: np.random.seed(8675309)
...: p = np.random.rand(3, 5)
...: p /= p.sum(axis=1, keepdims=True)
...: q = np.random.rand(5)
...: q /= q.sum()
This is the calculation that you want, using a Python loop and scipy.stats.entropy. I include this here so the result can be compared to the vectorized calculation below.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
We have constructed p and q so that the probability vectors
each sum to 1. In this case, the above result can also be
computed in a vectorized calculation with scipy.special.rel_entr or scipy.special.kl_div. (I recommend rel_entr. kl_div adds and subtracts additional terms that will ultimately cancel out in the sum, so it does a bit more work than necessary.)
These functions compute only the point-wise part of the calculations;
you have to sum the result to get the actual entropy or divergence.
In [20]: rel_entr(p, q).sum(axis=1)
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
In [21]: kl_div(p, q).sum(axis=1)
Out[21]: array([0.32253909, 0.17897139, 0.26279053])

Related

Efficient way to fill NumPy array for independent entries?

I'm currently trying to fill a matrix K where each entry in the matrix is just a function applied to two entries of an array x.
At the moment I'm using the most obvious method of running through rows and columns one at a time using a double for-loop:
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
for j in range(x.shape[0]):
K[i,j] = f(x[i],x[j])
While this works fine the resulting matrix is a 10,000 by 10,000 matrix and takes very long to calculate. I was wondering if there is a more efficient way to do this built into NumPy?
EDIT: The function in question here is a gaussian kernel:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.dot(vec,vec)/(2*sigma**2))
where I set sigma in advance before calculating the matrix.
The array x is an array of shape (10000, 8). So the scalar product in the gaussian is between two vectors of dimension 8.
You can use a single for loop together with broadcasting. This requires to change the implementation of the gaussian function to accept 2D inputs:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.sum(vec**2, axis=-1)/(2*sigma**2))
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
K[i] = gaussian(x[i:i+1], x)
Theoretically you could accomplish this even without any for loop, again by using broadcasting, but here an intermediary array of size len(x)**2 * x.shape[1] will be created which might run out of memory for your array sizes:
K = gaussian(x[None, :, :], x[:, None, :])

Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.

Vectorise Python code

I have coded a kriging algorithm but I find it quite slow. Especially, do you have an idea on how I could vectorise the piece of code in the cons function below:
import time
import numpy as np
B = np.zeros((200, 6))
P = np.zeros((len(B), len(B)))
def cons():
time1=time.time()
for i in range(len(B)):
for j in range(len(B)):
P[i,j] = corr(B[i], B[j])
time2=time.time()
return time2-time1
def corr(x,x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))))
time_av = 0.
for i in range(30):
time_av+=cons()
print "Average=", time_av/100.
Edit: Bonus questions
What happens to the broadcasting solution if I want corr(B[i], C[j]) with C the same dimension than B
What happens to the scipy solution if my p-norm orders are an array:
p=np.array([1.,2.,1.,2.,1.,2.])
def corr(x, x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))**p))
For 2., I tried P = np.exp(-cdist(B, C,'minkowski', p)) but scipy is expecting a scalar.
Your problem seems very simple to vectorize. For each pair of rows of B you want to compute
P[i,j] = np.exp(-np.sum(np.abs(B[i,:] - B[j,:])))
You can make use of array broadcasting and introduce a third dimension, summing along the last one:
P2 = np.exp(-np.sum(np.abs(B[:,None,:] - B),axis=-1))
The idea is to reshape the first occurence of B to shape (N,1,M) while the second B is left with shape (N,M). With array broadcasting, the latter is equivalent to (1,N,M), so
B[:,None,:] - B
is of shape (N,N,M). Summing along the last index will then result in the (N,N)-shape correlation array you're looking for.
Note that if you were using scipy, you would be able to do this using scipy.spatial.distance.cdist (or, equivalently, a combination of scipy.spatial.distance.pdist and scipy.spatial.distance.squareform), without unnecessarily computing the lower triangular half of this symmetrix matrix. Using #Divakar's suggestion in comments for the simplest solution this way:
from scipy.spatial.distance import cdist
P3 = 1/np.exp(cdist(B, B, 'minkowski',1))
cdist will compute the Minkowski distance in 1-norm, which is exactly the sum of the absolute values of coordinate differences.

How to call a function with parameters as matrix?

I am trying to call scipy.stats.multivariate_normal with four different parameters for mu and sigma. And then for each generated probability density function I need to call that pdf on an array of say, 10 values.
For simplicity let's say that above mentioned function is addXY:
def addXY(x, y):
return x+y
params=[[1,2],[1,3],[1,4],[1,5]] # mu and sigma, four versions
inputs=[1,2,3] # values, in this case 3 of them
matrix = []
for pdf_params in params:
row = []
for inp in inputs:
entry = addXY(*pdf_params)
row.append(entry*inp)
matrix.append(row)
print matrix
Is this pythonic?
Is there a way to pass params and inputs and get a matrix with all combinations in it that is more pythonic/vectorized/faster?
!Important notice: Inputs in the example are scalar values (I've set scalar values to simplify problem description, I am actually using array of n-dimensional vectors and thus multivariate_normal pdf).
Hints and tips about similar operations are welcome.
Based on your description of what you are trying to compute, you don't need multivariate_normal. You are calling the PDF method with a set of scalar values for a distribution with a scalar mu and sigma. So you can use the pdf() method of scipy.stats.norm. This method will broadcast its arguments, so by passing in arrays with the proper shape, you can compute the PDF for the different values of mu and sigma in one call. Here's an example.
Here are your x values (you called them inputs), and the parameters:
In [23]: x = np.array([1, 2, 3])
In [24]: params = np.array([[1, 2], [1, 3], [1, 4], [1, 5]])
For convenience, separate the parameters into arrays of mu and sigma values.
In [25]: mu = params[:, 0]
In [26]: sig = params[:, 1]
We'll use scipy.stats.norm to compute the PDF.
In [27]: from scipy.stats import norm
This call computes the PDF for the desired combinations of x and parameters. mu.reshape(-1, 1) and sig.reshape(-1, 1) are 2D arrays with shape (4, 1). x has shape (3,), so when these arguments are broadcast, the result has shape (4, 3). Each row is the PDF evaluated at x for one of the pairs of mu and sigma.
In [28]: p = norm.pdf(x, loc=mu.reshape(-1, 1), scale=sig.reshape(-1, 1))
In [29]: p
Out[29]:
array([[ 0.19947114, 0.17603266, 0.12098536],
[ 0.13298076, 0.12579441, 0.10648267],
[ 0.09973557, 0.09666703, 0.08801633],
[ 0.07978846, 0.07820854, 0.07365403]])
In other words, the rows of p are:
norm.pdf(x, loc=mu[0], scale=sig[0])
norm.pdf(x, loc=mu[1], scale=sig[1])
norm.pdf(x, loc=mu[2], scale=sig[2])
norm.pdf(x, loc=mu[3], scale=sig[3])
This is only my idea to shorten the code and utilize more library.
In your code, in fact, you do not use numpy, scipy. Question will be whether you would like to use numpy.array for further data processing.
Option 1: just use list to present array and list of list to present matrix:
from itertools import product
matrix_list = [sum(param)*input_x for param, input_x in product(params, inputs)]
matrix = zip(*[iter(matrix_list)]*len(inputs))
print matrix
Credit for using zip method should be given to
convert a flat list to list of list in python
Option 2: use numpy.array and numpy.matrix for further processing
from itertools import product
import numpy as np
matrix_array = np.array([sum(param)*input_x for param, input_x in product(params, inputs)])
matrix = matrix_array.reshape(len(params),len(inputs))
print matrix

Numpy- weight and sum rows of a matrix

Using Python & Numpy, I would like to:
Consider each row of an (n columns x
m rows) matrix as a vector
Weight each row (scalar
multiplication on each component of
the vector)
Add each row to create a final vector
(vector addition).
The weights are given in a regular numpy array, n x 1, so that each vector m in the matrix should be multiplied by weight n.
Here's what I've got (with test data; the actual matrix is huge), which is perhaps very un-Numpy and un-Pythonic. Can anyone do better? Thanks!
import numpy
# test data
mvec1 = numpy.array([1,2,3])
mvec2 = numpy.array([4,5,6])
start_matrix = numpy.matrix([mvec1,mvec2])
weights = numpy.array([0.5,-1])
#computation
wmatrix = [ weights[n]*start_matrix[n] for n in range(len(weights)) ]
vector_answer = [0,0,0]
for x in wmatrix: vector_answer+=x
Even a 'technically' correct answer has been all ready given, I'll give my straightforward answer:
from numpy import array, dot
dot(array([0.5, -1]), array([[1, 2, 3], [4, 5, 6]]))
# array([-3.5 -4. -4.5])
This one is much more on with the spirit of linear algebra (and as well those three dotted requirements on top of the question).
Update:
And this solution is really fast, not marginally, but easily some (10- 15)x faster than all ready proposed one!
It will be more convenient to use a two-dimensional numpy.array than a numpy.matrix in this case.
start_matrix = numpy.array([[1,2,3],[4,5,6]])
weights = numpy.array([0.5,-1])
final_vector = (start_matrix.T * weights).sum(axis=1)
# array([-3.5, -4. , -4.5])
The multiplication operator * does the right thing here due to NumPy's broadcasting rules.

Categories