Iterate through numpy array testing multiple elements efficiently

Iterate through numpy array testing multiple elements efficiently - python

I have the following code which iterates through 2d numpy array named "m". It works extremely slow. How can I transform this code using numpy functions so that I avoid using the for loops?
pairs = []
for i in range(size):
for j in range(size):
if(i >= j):
continue
if(m[i][j] + m[j][i] >= 0.75):
pairs.append([i, j, m[i][j] + m[j][i]])

You can use vectorised approach using NumPy. The idea is:
First initialize a matrix m and then create m+m.T which is equivalent to m[i][j] + m[j][i] where m.T is the matrix transpose and call it summ
np.triu(summ) returns the upper triangular part of the matrix (This is equivalent to ignoring the lower part by using continue in your code). This avoids explicit if(i >= j): in your code. Here you have to use k=1 to exclude the diagonal elements. By default, k=0 which includes the diagonal elements as well.
Then you get the indices of points using np.argwhere where the sum m+m.T is greater than equal to 0.75
Then you store those indices and the corresponding values in a list for later processing/printing purposes.
Verifiable example (using a small 3x3 random dataset)
import numpy as np
np.random.seed(0)
m = np.random.rand(3,3)
summ = m + m.T
index = np.argwhere(np.triu(summ, k=1)>=0.75)
pairs = [(x,y, summ[x,y]) for x,y in index]
print (pairs)
# # [(0, 1, 1.2600725493693163), (0, 2, 1.0403505873343364), (1, 2, 1.537667113848736)]
Further performance improvement
I just worked out an even faster approach to generate the final pairs list avoiding explicit for loops as
pairs = list(zip(index[:, 0], index[:, 1], summ[index[:,0], index[:,1]]))

One way to optimize your code is to avoid comparison if (i >= j). To traverse only the lower triangle of the array without that comparison, you have to make the inner loop start with the value of i of the outermost loop. That way, you avoid size x size if comparisons.
import numpy as np
size = 5000
m = np.random.rand(size, size)
pairs = []
for i in range(size):
for j in range(i , size):
if(m[i][j] + m[j][i] >= 0.75):
pairs.append([i, j, m[i][j] + m[j][i]])

Related

Conversion of Matlab function into Python function

The following function is written on Matlab. Now, I need to write an equivalent python function that will produce a similar output as Matlab. Can you help write the code, please?
function CORR=function_AutoCorr(tau,y)
% This function will generate a matrix, Where on-diagonal elements are autocorrelation and
% off-diagonal elements are cross-correlations
% y is the data set. e.g., a 10 by 9 Matrix.
% tau is the lag value. e.g. tau=1
Size=size(y);
N=Size(1,2); % Number of columns
T=Size(1,1); % length of the rows
for i=1:N
for j=1:N
temp1=0;
for t=1:T-tau
G=0.5*((y(t+tau,i)*y(t,j))+(y(t+tau,j)*y(t,i)));
temp1=temp1+G;
end
CORR(i,j)=temp1/(T-tau);
end
end
end

Assuming that y is a numpy Array, it would be pretty near something like (although I have not tested):
import numpy as np
def function_AutoCorr(tau, y):
Size = y.shape
N = Size[1]
T = Size[0]
CORR = np.zeros(shape=(N,N))
for i in range(N):
for j in range(N):
temp1 = 0
for t in range(T - tau):
G=0.5*((y[t+tau,i]*y[t,j])+(y[t+tau,j]*y[t,i]))
temp1 = temp1 + G
CORR[i, j] = temp1/(T - tau)
return CORR
y = np.array([[1,2,3], [4,5,6], [6,7,8], [13,14,15]])
print(y)
result = function_AutoCorr(1, y)
print(result)
The resulting CORR matrix for this example is:
If you want to run the function for different tau values, you could do, in Python:
result = [function_AutoCorr(tau, y) for tau in range(1, 11)]
The result will be a list of autocorrelation matrices, which are numpy arrays. This syntax is called a list comprehension.

You'll probably want to use NumPy. They even have a guide for Matlab users.
Here are some useful tips.
Defining a function
def auto_corr(tau, y):
"""Generate matrix of correlations"""
# Do calculations
return corr
Get the size of a numpy array
n_rows, n_cols = y.shape
Indexing
Indexing is 0-based and uses brackets ([]) instead of parentheses.

Quick way to divide matrix entries K_ij by K_ii*K_jj in Python

In Python, I have a matrix K of dimensions (N x N). I want to normalize K by dividing every entry K_ij by sqrt(K_(i,i)*K_(j,j)). What is a fast way to achieve this in Python without iterating through every entry?
My current solution is:
import numpy as np
K = np.random.rand(3,3)
diag = np.diag(K)
for i in range(np.shape(K)[0]):
for j in range(np.shape(K)[1]):
K[i,j] = K[i,j]/np.sqrt(diag[i]*diag[j])

Of course you have to iterate through every entry, at least internally. For square matrices:
K / np.sqrt(np.einsum('ii,jj->ij', K, K))
If the matrix is not square, you first have to define what should replace the "missing" values K[i,i] where i > j etc.
Alternative: use numba to leave your loop as is, get free speedup, and even avoid intermediate allocation:
#njit
def normalize(K):
M = np.empty_like(K)
m, n = K.shape
for i in range(m):
Kii = K[i,i]
for j in range(n):
Kjj = K[j,j]
M[i,j] = K[i,j] / np.sqrt(Kii * Kjj)
return M

Hierarchical agglomerative clustering: how to update distance matrix?

I would like to implement the simple hierarchical agglomerative clustering according to the pseudocode:
I got stuck at the last part where I need to update the distance matrix. So far I have:
import numpy as np
X = np.array([[1, 2],
[0, 3],
[2, 3],])
# Clusters
C = np.zeros((X.shape[0], X.shape[0]))
# Keeps track of active clusters
I = np.zeros(X.shape[0])
# For all n datapoints
for n in range(X.shape[0]):
for i in range(X.shape[0]):
# Compute the similarity of all N x N pairs of images
C[n][i] = np.linalg.norm(X[n] - X[i])
I[n] = 1
# Collects clustering as a sequence of merges
A = []
In each of N iterations
for k in range(X.shape[0] - 1):
# TODO: Find the indices of the smallest distance
# Updated distance matrix
I would like to implement the single-linkage clustering, so I would like to find the argmin of the distance matrix. I originally thought about doing something like:
i, m = np.where(C == np.min(C[np.nonzero(C)]))
i, m = i[0], m[0]
A.append((i, m))
to find the argmin, but I think it is incorrect as it doesn't specify a condition on the active clusters in I. I am also confused because I should just be looking at the upper or lower triangle of the matrix, so if I use the above method I could get the same argmin twice due to symmetry.
I was also thinking about first creating the rows and columns of the new merged cluster:
C = np.vstack((C, np.zeros((1, C.shape[1]))))
C = np.hstack((C, np.zeros((C.shape[0], 1))))
Then somehow update it like:
for j in range(X.shape[0]):
C[i][j] = min(C[i][j], C[m][j])
C[j][i] = min(C[i][j], C[m][j])
I am not sure if this is right approach. Is there a simpler way to find the argmin, merge the rows and columns and update the values?

If you get confused when how to find row and column indexes of minimum dist error,
Firstly,
To avoid getting argmin twice due to symmetry you can construct your initial distance matrix in shape of lower triangle matrix.
def euclidean_distance(p1,p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
distance_matrix = np.zeros((len(X.shape[0]),len(X.shape[0])))
for i in range(len(distance_matrix)):
for j in range(i):
distance_matrix[i][j] = euclidean_distance(X[i],X[j])
Secondly,
You can do your min search in the given matrix by hand if you don't like to use np tools or you are looking for a simple way.
min_value = np.inf
for i in range(len(distance_matrix)):
for j in range(i):
if( distance_matrix[i][j] < min_value):
min_value = distance_matrix[i][j]
min_i = i
min_j = j
Finally,
Update the distance matrix and merge clusters as fallows:
for i in range(len(distance_matrix)):
if( i > min_i and i < min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[min_j][i])
elif( i > min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[i][min_j])
for j in range(len(distance_matrix)):
if( j < min_i ):
distance_matrix[min_i][j] = min(distance_matrix[min_i][j],distance_matrix[min_j][j])
#remove one of the old clusters data from the distance matrix
distance_matrix = np.delete(distance_matrix, min_j, axis=1)
distance_matrix = np.delete(distance_matrix, min_j, axis=0)
A[min_i] = A[min_i] + A[min_j]
A.pop(min_j)

generating random matrices in python

In the following code I have implemented Gaussian elimination with partial pivoting for a general square linear system Ax=b. I have tested my code and it produced the right output. However now I am trying to do the following but I am not quite sure how to code it, looking for some help with this!
I want to test my implementation by solving Ax=b where A is a random 100x100 matrix and b is a random 100x1 vector.
In my code I have put in the matrices
A = np.array([[3.,2.,-4.],[2.,3.,3.],[5.,-3.,1.]])
b = np.array([[3.],[15.],[14.]])
and gotten the following correct output:
[3. 1. 2.]
[3. 1. 2.]
but now how do I change it to generate the random matrices?
here is my code below:
import numpy as np
def GEPP(A, b, doPricing = True):
'''
Gaussian elimination with partial pivoting.
input: A is an n x n numpy matrix
b is an n x 1 numpy array
output: x is the solution of Ax=b
with the entries permuted in
accordance with the pivoting
done by the algorithm
post-condition: A and b have been modified.
'''
n = len(A)
if b.size != n:
raise ValueError("Invalid argument: incompatible sizes between"+
"A & b.", b.size, n)
# k represents the current pivot row. Since GE traverses the matrix in the
# upper right triangle, we also use k for indicating the k-th diagonal
# column index.
# Elimination
for k in range(n-1):
if doPricing:
# Pivot
maxindex = abs(A[k:,k]).argmax() + k
if A[maxindex, k] == 0:
raise ValueError("Matrix is singular.")
# Swap
if maxindex != k:
A[[k,maxindex]] = A[[maxindex, k]]
b[[k,maxindex]] = b[[maxindex, k]]
else:
if A[k, k] == 0:
raise ValueError("Pivot element is zero. Try setting doPricing to True.")
#Eliminate
for row in range(k+1, n):
multiplier = A[row,k]/A[k,k]
A[row, k:] = A[row, k:] - multiplier*A[k, k:]
b[row] = b[row] - multiplier*b[k]
# Back Substitution
x = np.zeros(n)
for k in range(n-1, -1, -1):
x[k] = (b[k] - np.dot(A[k,k+1:],x[k+1:]))/A[k,k]
return x
if __name__ == "__main__":
A = np.array([[3.,2.,-4.],[2.,3.,3.],[5.,-3.,1.]])
b = np.array([[3.],[15.],[14.]])
print (GEPP(np.copy(A), np.copy(b), doPricing = False))
print (GEPP(A,b))

You're already using numpy. Have you considered np.random.rand?
np.random.rand(m, n) will get you a random matrix with values in [0, 1). You can further process it by multiplying random values or rounding.
EDIT: Something like this
if __name__ == "__main__":
A = np.round(np.random.rand(100, 100)*10)
b = np.round(np.random.rand(100)*10)
print (GEPP(np.copy(A), np.copy(b), doPricing = False))
print (GEPP(A,b))

So I would use np.random.randint for this.
numpy.random.randint(low, high=None, size=None, dtype='l')
which outputs a size-shaped array of random integers from the appropriate distribution, or a single such random int if size not provided.
low is the lower bound of the ints you want in your range
high is one greater than the upper bound in your desired range
size is the dimensions of your output array
dtype is the dtype of the result
so if I was you I would write
A = np.random.randint(0, 11, (100, 100))
b = np.random.randint(0, 11, 100)

Basically you could create the desired matrices with ones and then iterate over them, setting each value to random.randint(0,100) for example.
Empty matrix with ones is:
one_array = np.ones((100, 100))
EDIT:
like:
for x in one_array.shape[0]:
for y in one_array.shape[1]:
one_array[x][y] = random.randint(0, 100)

A = np.random.normal(size=(100,100))
b = np.random.normal(size=(100,1))
x = np.linalg.solve(A,b)
assert max(abs(A#x - b)) < 1e-12
Clearly, you can use different distributions than normal, like uniform.

You can use numpy's native rand function:
np.random.rand()
In your code just define A and b as:
A = np.random.rand(100, 100)
b = np.random.rand(100)
This will generate 100x100 matrix and 100x1 vector (both numpy arrays) filled with random values between 0 and 1.
See the docs for this function to learn more.

Can I vectorize this 2d array indexing where the 2nd dimension depends on the value of the first?

In the example below I have a 2D array that has some real results that are shifted and padded. The shifts depend on the row (the padding is used to make the array rectangular as required by numpy). Is it possible to extract the real results without a Python loop?
import numpy as np
# results are 'shifted' where the shift depends on the row
shifts = np.array([0, 8, 4, 2], dtype=int)
max_shift = shifts.max()
n = len(shifts)
t = 10 # length of the real results we care about
a = np.empty((n, t + max_shift), dtype=int)
b = np.empty((n, t), dtype=int)
for i in range(n):
a[i] = np.concatenate([[0] * shifts[i], # shift
(i+1) * np.arange(1, t+1), # real data
[0] * (max_shift - shifts[i]) # padding
])
print "shifted and padded\n", a
# I'd like to remove this Python loop if possible
for i in range(n):
b[i] = a[i, shifts[i]:shifts[i] + t]
print "real data\n", b

You can use two array to get the data out:
a[np.arange(4)[:, None], shifts[:, None] + np.arange(10)]
or:
i, j = np.ogrid[:4, :10]
a[i, shifts[:, None]+j]
This is called Advanced indexing in NumPy document.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through numpy array testing multiple elements efficiently - python

Related

Conversion of Matlab function into Python function

Quick way to divide matrix entries K_ij by K_ii*K_jj in Python

Hierarchical agglomerative clustering: how to update distance matrix?

generating random matrices in python

Can I vectorize this 2d array indexing where the 2nd dimension depends on the value of the first?

Categories

Resources