Summing terms in jagged arrays in python

Summing terms in jagged arrays in python - python

I want to compute a sum of terms relating to the following jagged arrays x, tokens, and phi with the same shape, and with the following forms:
x has length n, with x[i] taking various lengths, (here depicted with n=5 but in reality n is very large, in the millions)
tokens has the same shape as x but where each entry is a number between 1 and 200000:
Lastly phi also has the same shape as x and tokens.
I want to compute, for each v=1,...,200000, the sum over all products x[i][j]* phi[i][j] for all i,j where tokens[i][j]==v.
Specifically I want to get out a 200000-vector where the first term is the sum over all indices i,j where tokens[i][v]==1.
Now, the only ways I could think to do this are brute-force, which are
turn x, tokens and phi into sparse matrices of shape n*200000 and then take the componentwise product np.multiply(x,phi). But this is way too large since n is in the millions.
manually get each location, something like t = np.argwhere(tokens==v) then take the product x[i][t]*phi[i][t]. But this also seems very slow since I would have to search a million arrays for 200000 different values.
Is there a better way, or which of these two would you pick?

Related

Generating invertible matrices in numpy/tensorflow

I would like to generate invertible matrices (specifically those from GL(n), a general linear group of size n) using Tensorflow and/or Numpy for use with my neural network.
How can this be done and what would be the best way of doing so?
I understand there is a way to generate symmetric invertible matrices by computing (A + A.T)/2 for arbitrary square matrices A, however, I would like mine to not just be symmetric.

I happened to have found one way which I believe can generate a large variety of random invertible matrices using diagonal dominance.
The theorem is that given an nxn matrix, if the abs of the diagonal element is larger than the sum of the abs of all the row elements with respect to the row the diagonal element is in, and this holds true for all rows, then the underlying matrix is invertible. (here is the corresponding wikipedia article: https://en.wikipedia.org/wiki/Diagonally_dominant_matrix)
Therefore the following code snippet generates an arbitrary invertible matrix.
n = 5 # size of invertible matrix I wish to generate
m = np.random.rand(n, n)
mx = np.sum(np.abs(m), axis=1)
np.fill_diagonal(m, mx)

Python: Are sparse diagonal matrices more computationally efficent?

I have a function which currently multiplies a matrix in scipy.sparse.csr_matrix form by a vector. I use this function for different values lots of times and I would like the matrix * vector multiplication to be as efficient as possible. The matrix is an N x N matrix, but only contains m x N non-zero elements, where m << N. The non-zero elements are currently arranged randomly about the matrix. I could perform row operations to get this matrix in a form such that all the elements appear on only m + 2 diagonals. Then use scipy.sparse.dia_matrix instead of scipy.sparse.csr_matrix. It will take quite a bit of work so I was wondering if anyone knows if this will even improve the computational efficiency?

Is there a way to vectorize a nested loop that calculates the spearman correlation and it's p-values?

I have a matrix m with 8300 columns and 18 rows. Each column represents a gene; and each row, a sample. I want to calculate the adjacency matrix (using spearman correlation) and the corresponding p-value matrix.
The code I've got so far is:
W = np.zeros((n_genes, n_genes))
P = np.zeros((n_genes, n_genes))
for i in range(0, n_genes):
for j in range(0, n_genes):
W[i,j], P[i,j] = st.spearmanr(m[:,i], m[:,j])
Which is amazingly inefficient (It takes around 11 hours to run in colab-google using GPU). Is there a way to vectorize this?
Thank you a lot!

https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.spearmanr.html
It looks like with this function you can pass in your entire m matrix for both arguments and it will do correlations and p-values between all of the columns, which it interprets as the variables (rows being samples of the variables). It then outputs the p-values and correlations in matrix forms. Therefore you can get rid of the for loops and produce the correlation and p-value matrices in one go. Even without doing this in one pass, it looks like you are going through all the data twice to form a symmetric matrix; I would have done the second loop as "for j in range(i, n_genes):" then filled out two entries [i,j] and [j,i] in the body of the loop.

Is there a function for calculating all the pairwise dot products of the columns of a matrix, or between all pairs in a list of vectors?

Say you have n. vectors of arbitrary (but equal) length m each. Is there a (numpy?) function, or a quick way, of calculating all pairwise dot products between these n. vectors?
My initial thought was that you could calculate ATA and take the upper triangular portion, but I'm not sure if that matrix multiplication is slow/computationally intensive. Is there a quicker/efficient way? Or should I just define a function using a for loop for all combinations of pairs?

As per #Brenila’s comment, use tensordot:
np.tensordot(arr, arr, axes=(0,0))
Result shape is (n, n) for n = arr.shape[-1]

Create 2D lists in python with variable length indexed vectors

I am working on an image processing problem where I have code that looks like this (the code written below just illustrates the type of problem I want to solve):
for i in range(0,10):
for j in range(0,10):
number_length = round(random.random()*10)
a = np.zeros(number_length)
Z[i][j] = a
What I want to do is create some sort of 2D list or np.array (not really sure) where I essentially index a term for every pixel in an image, and have a vector/list of values for every individual pixel of which I can not anticipate its length, moreover, the length of each vector for every indexed pixel is different to each other. What is the best way to go about this?
In my MATLAB code the workaround is simple: I define a 2D cell and just assign any vector to any element in the 2D cell. Since cells do not complain about coherent length of every indexed vector, this is a good thing. What is the equivalent optimal solution to handle this in python?
Ideally the solution should not involve anticipating the maximum length of "a" for any pixel and to make all indexed vectors the same length (since this implies I have to do some sort of zero padding that will consume memory if the indexed vectors are high dimensional and these high dimensional vectors are sparse through out the image).

A NumPy array won't work because it requires fixed dimensions. You can use a 2d list (i.e. list of lists), where each element can be an array of arbitrary length. This is analogous to your setup in Matlab, using a 2d cell array of vectors.
Try this:
z = [[np.zeros(np.random.randint(10)+1) for j in range(10)] for i in range(10)]
This creates a 10x10 list, where z[i][j] is a NumPy array of zeros with random length (from 1 to 10).
Edit (nested loops requested in comment):
z = [[None for j in range(10)] for i in range(10)]
for i in range(len(z)):
for j in range(len(z[i])):
z[i][j] = np.zeros(np.random.randint(10)+1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summing terms in jagged arrays in python - python

Related

Generating invertible matrices in numpy/tensorflow

Python: Are sparse diagonal matrices more computationally efficent?

Is there a way to vectorize a nested loop that calculates the spearman correlation and it's p-values?

Is there a function for calculating all the pairwise dot products of the columns of a matrix, or between all pairs in a list of vectors?

Create 2D lists in python with variable length indexed vectors

Categories

Resources