Vectorize sampling from a multidimensional array [duplicate] - python

This question already has an answer here:
How to draw a sample from a categorical distribution
(1 answer)
Closed 1 year ago.
I have a numpy array of shape D x N x K.
I need a resulting D x N array of random elements out of K classes, where for each index [d, n] there is a different probability vector for the classes, indicated by the third axis.
The numpy documentation for np.random.choice only allows 1D array for probabilities.
Can I vectorize this type of sampling, or do I have to use a for loop as follows:
# input_array of shape (D, N, K)
# output_array of shape (D, N)
for d in range(input_array.shape[0]):
for n in range(input_array.shape[1]):
probabilities = input_array[d, n]
element = np.random.choice(K, p=probabilities)
output_array[d, n] = element
I would have loved if there is a function such as
output_array = np.random.choice(input_array, K, probability_axis=-1)
Edit: Managed to find a "hand engineered" solution here.

Neither np.random.choice nor np.random.default_rng().choice support broadcasting of probabilities in the way that you intend. However, you can cobble together something that works similarly using np.cumsum:
sprob = input_array.cumsum(axis=-1, dtype=float)
sprob /= sprob[:, :, -1:]
output_array = (np.random.rand(D, N, 1) > sprob).argmin(-1)
Unfortunately, np.searchsorted does not support multi-dimensional lookup either (probably for related reasons).

Related

Efficient way to fill NumPy array for independent entries?

I'm currently trying to fill a matrix K where each entry in the matrix is just a function applied to two entries of an array x.
At the moment I'm using the most obvious method of running through rows and columns one at a time using a double for-loop:
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
for j in range(x.shape[0]):
K[i,j] = f(x[i],x[j])
While this works fine the resulting matrix is a 10,000 by 10,000 matrix and takes very long to calculate. I was wondering if there is a more efficient way to do this built into NumPy?
EDIT: The function in question here is a gaussian kernel:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.dot(vec,vec)/(2*sigma**2))
where I set sigma in advance before calculating the matrix.
The array x is an array of shape (10000, 8). So the scalar product in the gaussian is between two vectors of dimension 8.
You can use a single for loop together with broadcasting. This requires to change the implementation of the gaussian function to accept 2D inputs:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.sum(vec**2, axis=-1)/(2*sigma**2))
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
K[i] = gaussian(x[i:i+1], x)
Theoretically you could accomplish this even without any for loop, again by using broadcasting, but here an intermediary array of size len(x)**2 * x.shape[1] will be created which might run out of memory for your array sizes:
K = gaussian(x[None, :, :], x[:, None, :])

Summing vector pairs efficiently in pytorch

I'm trying to calculate the summation of each pair of rows in a matrix. Suppose I have an m x n matrix, say one like
[[1,2,3],
[4,5,6],
[7,8,9]]
and I want to create a matrix of the summations of all pairs of rows. So, for the above matrix, we would want
[[5,7,9],
[8,10,12],
[11,13,15]]
In general, I think the new matrix will be (m choose 2) x n. For the above example in pytorch, I ran
import torch
x = torch.tensor([[1,2,3], [4,5,6], [7,8,9]])
y = x[None] + x[:, None]
torch.cat((y[0, 1:3, :], y[1, 2:3, :]))
which manually creates the matrix I am looking for. However, I am struggling to think of a way to create the output without manually specifying indices and without using a for-loop. Is there even a way to create such a matrix for an arbitrary matrix without the use of a for-loop?
You can try using this function:
def sum_rows(x):
y = x[None] + x[:, None]
ind = torch.tril_indices(x.shape[0], x.shape[0], offset=-1)
return y[ind[0], ind[1]]
Because you know you want pairs with the constraints of sum_matrix[i,j], where i<j (but i>j would also work), you can just specify that you want the lower/upper triangle indices of your 3D matrix. This still uses a for loop, AFAIK, but should do the job for variable-sized inputs.

Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.

Why does this array indexing in numpy work?

I've got some numpy 2d arrays:
x, of shape(N,T)
W, of shape(V,D)
they are described as the following:
"Minibatches of size N where each sequence has length T. We assume a vocabulary of V words, assigning each to a vector of dimension D."(This is a question from cs231 A3.)
I want an output array of shape(N, T, D), where i can match the N elements to the desired vectors.
First I came out with the solution using a loop to run through all the elements in the first row of x:
for n in range(N):
out[n, :, :] = W[x[n, :]]
Then I go on to experiment with the second solution:
out = W[x]
Both solutions gave me the right answer, but why does the second solution work? Why can I index a 3d array in a 2d array?

dot product of vectors in multidimentional matrices (python, numpy)

I have two matrices A, B, NxKxD dimensions and I want get matrix C, NxKxDxD dimensions, where C[n, k] = A[n, k] x B[n, k].T (here "x" means product of matrices of dimensions Dx1 and 1xD, so the result must be DxD dimensional), so now my code looking like this (here A = B = X):
def square(X):
out = np.zeros((N, K, D, D))
for n in range(N):
for k in range(K):
out[n, k] = np.dot(X[n, k, :, np.newaxis], X[n, k, np.newaxis, :])
return out
It may be slow for big N and K because of python's for cycle. Is there some way to make this multiplication in one numpy function?
It seems you are not using np.dot for sum-reduction, but just for expansion that results in broadcasting. So, you can simply extend the array to have one more dimension with the use of np.newaxis/None and let the implicit broadcasting help out.
Thus, an implementation would be -
X[...,None]*X[...,None,:]
More info on broadcasting specifically how to add new axes could be found in this other post.

Categories