Vectorise Python code - python

I have coded a kriging algorithm but I find it quite slow. Especially, do you have an idea on how I could vectorise the piece of code in the cons function below:
import time
import numpy as np
B = np.zeros((200, 6))
P = np.zeros((len(B), len(B)))
def cons():
time1=time.time()
for i in range(len(B)):
for j in range(len(B)):
P[i,j] = corr(B[i], B[j])
time2=time.time()
return time2-time1
def corr(x,x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))))
time_av = 0.
for i in range(30):
time_av+=cons()
print "Average=", time_av/100.
Edit: Bonus questions
What happens to the broadcasting solution if I want corr(B[i], C[j]) with C the same dimension than B
What happens to the scipy solution if my p-norm orders are an array:
p=np.array([1.,2.,1.,2.,1.,2.])
def corr(x, x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))**p))
For 2., I tried P = np.exp(-cdist(B, C,'minkowski', p)) but scipy is expecting a scalar.

Your problem seems very simple to vectorize. For each pair of rows of B you want to compute
P[i,j] = np.exp(-np.sum(np.abs(B[i,:] - B[j,:])))
You can make use of array broadcasting and introduce a third dimension, summing along the last one:
P2 = np.exp(-np.sum(np.abs(B[:,None,:] - B),axis=-1))
The idea is to reshape the first occurence of B to shape (N,1,M) while the second B is left with shape (N,M). With array broadcasting, the latter is equivalent to (1,N,M), so
B[:,None,:] - B
is of shape (N,N,M). Summing along the last index will then result in the (N,N)-shape correlation array you're looking for.
Note that if you were using scipy, you would be able to do this using scipy.spatial.distance.cdist (or, equivalently, a combination of scipy.spatial.distance.pdist and scipy.spatial.distance.squareform), without unnecessarily computing the lower triangular half of this symmetrix matrix. Using #Divakar's suggestion in comments for the simplest solution this way:
from scipy.spatial.distance import cdist
P3 = 1/np.exp(cdist(B, B, 'minkowski',1))
cdist will compute the Minkowski distance in 1-norm, which is exactly the sum of the absolute values of coordinate differences.

Related

Is there a numpy/scipy dot product for sparse matrix, calculating only the diagonal entries of the result?

Imagine having 2 sparse matrix:
> A, A.shape = (n,m)
> B, B.shape = (m,n)
I would like to compute the dot product A*B, but then only keep the diagonal. The matrices being big, I actually don't want to compute other values than the ones in the diagonal.
This is a variant of the question Is there a numpy/scipy dot product, calculating only the diagonal entries of the result?
Where the most relevant answer seems to be to use np.einsum:
np.einsum('ij,ji->i', A, B)
However this does not work:
ValueError: einstein sum subscripts string contains too many subscripts for operand 0
The solution is to use todense(), but it increases a lot the memory usage: np.einsum('ij,ji->i', A.todense(), B.todense())
The other solution, that I currently use, is to iterate over all the rows of A and compute each product in the loop :
for i in range(len_A):
result = np.float32(A[i].dot(B[:, i])[0, 0])
...
None of these solutions seems perfect. Is there an equivalent to np.einsum that could work with sparse matrices ?
[sum(A[i]*B.T[i]) for i in range(min(A.shape[0], B.shape[1]))]
otherwise this is faster:
l = min(A.shape[0], B.shape[1])
(A[np.arange(l)]*B.T[np.arange(l)]).sum(axis=1)
In general you shouldn't try to use numpy functions on the scipy.sparse arrays. In your case I'd first make sure both arrays actually have a compatible shape, that is
A, A.shape = (r,m)
B, B.shape = (m,r)
where r = min(n, p). Then we can compute the diagonal of the matrix product using
d = (A.multiply(B.T)).sum(axis=1)
Here we compute the entry wise row-column products, and manually sum them up. This avoids all the unnecessary computations you'd get using dot/#/*. (Note that unlike in numpy, both * and # perform matrix multiplication.)

How to efficiently operate on sub-arrays like calculating the determinants, inverse,

I have to to multiple operations on sub-arrays like matrix inversions or building determinants. Since for-loops are not very fast in Python I wonder what is the best way to do this.
import numpy as np
n = 8
a = np.random.rand(3,3,n)
b = np.empty(n)
c = np.zeros_like(a)
for i in range(n):
b[i] = np.linalg.det(a[:,:,i])
c[:,:,i] = np.linalg.inv(a[:,:,i])
Those numpy.linalg functions would accept n-dim arrays as long as the last two axes are the ones that form the 2D slices along which functions are intended to be operated upon. Hence, to solve our cases, permute axes to bring-up the axis of iteration as the first one, perform the required operation and if needed push-back that axis back to it's original place.
Hence, we could get those outputs, like so -
b = np.linalg.det(np.moveaxis(a,2,0))
c = np.moveaxis(np.linalg.inv(np.moveaxis(a,2,0)),0,2)

numpy - Computing "element-wise" difference between two arrays along first axis

Suppose I have two arrays A and B with dimensions (n1,m1,m2) and (n2,m1,m2), respectively. I want to compute the matrix C with dimensions (n1,n2) such that C[i,j] = sum((A[i,:,:] - B[j,:,:])^2). Here is what I have so far:
import numpy as np
A = np.array(range(1,13)).reshape(3,2,2)
B = np.array(range(1,9)).reshape(2,2,2)
C = np.zeros(shape=(A.shape[0], B.shape[0]) )
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j] = np.sum(np.square(A[i,:,:] - B[j,:,:]))
C
What is the most efficient way to do this? In R I would use a vectorized approach, such as outer. Is there a similar method for Python?
Thanks.
You can use scipy's cdist, which is pretty efficient for such calculations after reshaping the input arrays to 2D, like so -
from scipy.spatial.distance import cdist
C = cdist(A.reshape(A.shape[0],-1),B.reshape(B.shape[0],-1),'sqeuclidean')
Now, the above approach must be memory efficient and thus a better one when working with large datasizes. For small input arrays, one can also use np.einsum and leverage NumPy broadcasting, like so -
diffs = A[:,None]-B
C = np.einsum('ijkl,ijkl->ij',diffs,diffs)

find 2d elements in a 3d array which are similar to 2d elements in another 3d array

I have two 3D arrays and want to identify 2D elements in one array, which have one or more similar counterparts in the other array.
This works in Python 3:
import numpy as np
import random
np.random.seed(123)
A = np.round(np.random.rand(25000,2,2),2)
B = np.round(np.random.rand(25000,2,2),2)
a_index = np.zeros(A.shape[0])
for a in range(A.shape[0]):
for b in range(B.shape[0]):
if np.allclose(A[a,:,:].reshape(-1, A.shape[1]), B[b,:,:].reshape(-1, B.shape[1]),
rtol=1e-04, atol=1e-06):
a_index[a] = 1
break
np.nonzero(a_index)[0]
But of course this approach is awfully slow. Please tell me, that there is a more efficient way (and what it is). THX.
You are trying to do an all-nearest-neighbor type query. This is something that has special O(n log n) algorithms, I'm not aware of a python implementation. However you can use regular nearest-neighbor which is also O(n log n) just a bit slower. For example scipy.spatial.KDTree or cKDTree.
import numpy as np
import random
np.random.seed(123)
A = np.round(np.random.rand(25000,2,2),2)
B = np.round(np.random.rand(25000,2,2),2)
import scipy.spatial
tree = scipy.spatial.cKDTree(A.reshape(25000, 4))
results = tree.query_ball_point(B.reshape(25000, 4), r=1e-04, p=1)
print [r for r in results if r != []]
# [[14252], [1972], [7108], [13369], [23171]]
query_ball_point() is not an exact equivalent to allclose() but it is close enough, especially if you don't care about the rtol parameter to allclose(). You also get a choice of metric (p=1 for city block, or p=2 for Euclidean).
P.S. Consider using query_ball_tree() for very large data sets. Both A and B have to be indexed in that case.
P.S. I'm not sure what effect the 2d-ness of the elements should have; the sample code I gave treats them as 1d and that is identical at least when using city block metric.
From the docs of np.allclose, we have :
If the following equation is element-wise True, then allclose returns
True.
absolute(a - b) <= (atol + rtol * absolute(b))
Using that criteria, we can have a vectorized implementation using broadcasting, customized for the stated problem, like so -
# Setup parameters
rtol,atol = 1e-04, 1e-06
# Use np.allclose criteria to detect true/false across all pairwise elements
mask = np.abs(A[:,None,] - B) <= (atol + rtol * np.abs(B))
# Use the problem context to get final output
out = np.nonzero(mask.all(axis=(2,3)).any(1))[0]

Compute numpy array pairwise Euclidean distance except with self

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!
Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.
Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)
I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

Categories