Efficient Numpy computation of pairwise squared differences - python

The following code does exactly what I want, which is to compute the pairwise sum of squares of differences between elements of a vector (length three in the example), of which I have a long series (limited to five here). The desired result is shown at the bottom.
But the implementation feels kludgy for two reasons:
1) the need to add a phantom dimension, changing the shape from (5, 3) to (5,1,3) to avoid broadcast problems, and
2) the apparent necessity of an explicit 'for' loop, which I'm sure is why it's taking hours to execute on my much larger data set (a million vectors of length 2904).
Is there a more efficient and/or pythonic way to achieve the same result?
a = np.array([[ 4, 2, 3], [-1, -5, 4], [ 2, 1, 4], [-5, -1, 4], [6, -3, 3]])
a = a.reshape((5,1,3))
m = a.shape[0]
n = a.shape[2]
d = np.zeros((n,n))
for i in range(m):
c = a[i,:] - np.transpose(a[i,:])
c = c**2
d += c
print d
[[ 0. 118. 120.]
[ 118. 0. 152.]
[ 120. 152. 0.]]

If you don't mind the dependency on scipy, you can use functions from the scipy.spatial.distance library:
In [17]: from scipy.spatial.distance import pdist, squareform
In [18]: a = np.array([[ 4, 2, 3], [-1, -5, 4], [ 2, 1, 4], [-5, -1, 4], [6, -3, 3]])
In [19]: d = pdist(a.T, metric='sqeuclidean')
In [20]: d
Out[20]: array([ 118., 120., 152.])
In [21]: squareform(d)
Out[21]:
array([[ 0., 118., 120.],
[ 118., 0., 152.],
[ 120., 152., 0.]])

You could eliminate the for-loop by using:
In [48]: ((a - a.swapaxes(1,2))**2).sum(axis=0)
Out[48]:
array([[ 0, 118, 120],
[118, 0, 152],
[120, 152, 0]])
Note that if a has shape (N, 1, M) then (a - a.swapaxes(1,2)) has shape (N, M, M). Make sure you have enough RAM to accommodate an array of this size. Page swapping can also slow the calculation to a crawl.
If you do have too little memory, you will have to break up the calculation in chunks:
m, _, n = a.shape
chunksize = 10**4
d = np.zeros((n,n))
for i in range(0, m, chunksize):
b = a[i:i+chunksize]
d += ((b - b.swapaxes(1,2))**2).sum(axis=0)
This is a compromise between performing the calculation on the entire array and
calculating row-by-row. If there are a million rows, and the chunksize is 10**4, then there will be only 100 iterations of the loop instead of a million.
Thus, it should be significantly faster than calculating row-by-row. Choose the largest value of chunksize you can which allows the calculation to be performed in RAM.

Related

sum of squares (4 values shaping a square) within a 2d numpy array. Python

I am looking to sum each 4 point combination in a 2d array that forms squares within the matrix
in4x4 = np.array(([1,2,3,4],[2,3,4,1],[3,4,1,2],[4,3,1,2]))
print(in4x4)
array([[1, 2, 3, 4],
[2, 3, 4, 1],
[3, 4, 1, 2],
[4, 3, 1, 2]])
Expected output:
print(out3x3)
array([[ 8, 12, 12],
[12, 12, 8],
[14, 9, 6]]
Currently I am using numpy.diff in a several step process. Feel like there should be a cleaner approach. Example of the calculation I currently do:
diff3x4 = np.diff(a4x4,axis=0)
a3x4 = a4x4[0:3,:] * 2 + d3x4
d3x3 = np.diff(a3x4)
a3x3 = a3x4[:,0:3] * 2 + d3x3
Is there a clean possibly vectorized approach? Speed is of concern. Looked at scipy.convolve but does not seem to suit my purposes, but maybe I am just configuring the kernel wrong.
You can do 2-D convolution with a kerenl of ones to achieve the desired result.
Simple code that implements 2D-convolution and test on the input you gave:
import numpy as np
def conv2d(a, f):
s = f.shape + tuple(np.subtract(a.shape, f.shape) + 1)
strd = np.lib.stride_tricks.as_strided
subM = strd(a, shape = s, strides = a.strides * 2)
return np.einsum('ij,ijkl->kl', f, subM)
in4x4 = np.array(([1,2,3,4],[2,3,4,1],[3,4,1,2],[4,3,1,2]))
k = np.ones((2,2))
print(conv2d(in4x4, k))
output:
[[ 8. 12. 12.]
[12. 12. 8.]
[14. 9. 6.]]
In case you can use built-in function you can use like in the following:
import numpy as np
from scipy import signal
in4x4 = np.array(([1,2,3,4],[2,3,4,1],[3,4,1,2],[4,3,1,2]))
k = np.ones((2,2))
signal.convolve2d(in4x4, k, 'valid')
Which output:
array([[ 8., 12., 12.],
[12., 12., 8.],
[14., 9., 6.]])
signal.convolve2d(in4x4, k, 'valid')

Using numpy einsum to compute inner product of column-vectors of a matrix

Suppose I have a numpy matrix like this:
[[ 1 2 3]
[ 10 100 1000]]
I would like to compute the inner product of each column with itself, so the result would be:
[1*1 + 10*10 2*2 + 100*100 3*3 + 1000*1000] == [101, 10004, 1000009]
I would like to know if this is possible using the einsum function (and to better understand it).
So far, the closest result I could have is:
import numpy as np
arr = np.array([[1, 2, 3], [10, 100, 1000]])
res = np.einsum('ij,ik->jk', arr, arr)
# [[ 101 1002 10003]
# [ 1002 10004 100006]
# [ 10003 100006 1000009]]
The diagonal contains the expected result, but I would like to know if I can avoid edge calculations.
Use np.einsum, like so -
np.einsum('ij,ij->j',arr,arr)
Sample run -
In [243]: np.einsum('ij,ij->j',arr,arr)
Out[243]: array([ 101, 10004, 1000009])
Or with np.sum -
In [244]: (arr**2).sum(0)
Out[244]: array([ 101, 10004, 1000009])
Or with numexpr module -
In [248]: import numexpr as ne
In [249]: ne.evaluate('sum(arr**2,0)')
Out[249]: array([ 101, 10004, 1000009])
What you're expecting here can be understood intuitively, with one intermediate step from Divakar's einsum answer.
In [19]: arr
Out[19]:
array([[ 1, 2, 3],
[ 10, 100, 1000]])
# simply take element-wise product with the array itself
In [20]: np.einsum('ij, ij -> ij', arr, arr)
Out[20]:
array([[ 1, 4, 9],
[ 100, 10000, 1000000]])
But, this doesn't give the result that you expected. So, if you observe the above result, we just need to sum along the first dimension (i.e. axis 0). So, we omit the subscript i after -> in the einsum result, which means we ask it to sum along that axis, and that yields the expected result which is:
In [21]: np.einsum('ij, ij -> j', arr, arr)
Out[21]: array([ 101, 10004, 1000009])
P.S. Also, for a general understanding of np.einsum, see the detailed discussion here: understanding-numpy-einsum

How to efficiently sum and mean 2D NumPy arrays by id?

I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:
import numpy as np
a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res
I got the printed result as:
[[ 5. 7. 9.]
[ 2. 3. 4.]]
What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.
But it seems my implementation is quite slow for large array. Is there any better implementation?
I know there is a bincount function in numpy. But it seems only supports 1d array.
Thank you all for helping me!
The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:
import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)
Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.
import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)
Approach #1
You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -
np.add.at(res, b, a)
Sample run -
In [40]: a
Out[40]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [41]: b
Out[41]: array([0, 1, 0])
In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)
In [46]: np.add.at(res, b, a)
In [47]: res
Out[47]:
array([[5, 7, 9],
[2, 3, 4]])
To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -
In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -
def groupby_addat(a, b, out="sum"):
unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
np.add.at(res, tags, a)
if out=="mean":
return unqb, res/counts[:,None].astype(float)
elif out=="sum":
return unqb, res
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [204]: b_ids, means = groupby_addat(a, b, out="mean")
In [205]: b_ids
Out[205]: array([ 5, 10])
In [206]: means
Out[206]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Approach #2
We could also make use of np.add.reduceat and might be more performant -
def groupby_addreduceat(a, b, out="sum"):
sidx = b.argsort()
sb = b[sidx]
spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
sums = np.add.reduceat(a[sidx],spt_idx[:-1])
if out=="mean":
counts = spt_idx[1:] - spt_idx[:-1]
return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
elif out=="sum":
return sb[spt_idx[:-1]], sums
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")
In [208]: b_ids
Out[208]: array([ 5, 10])
In [209]: means
Out[209]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])

Comparing value with neighbor elements in numpy

Let's say I have a numpy array
a b c
A = i j k
u v w
I want to compare the value central element with some of its eight neighbor elements (along the axis or along the diagonal). Is there any faster way except the nested for loop (it's too slow for big matrix)?
To be more specific, what I want to do is compare value of element with it's neighbors and assign new values.
For example:
if (j == 1):
if (j>i) & (j>k):
j = 999
else:
j = 0
if (j == 2):
if (j>c) & (j>u):
j = 999
else:
j = 0
...
something like this.
Your operation contains lots of conditionals, so the most efficient way to do it in the general case (any kind of conditionals, any kind of operations) is using loops. This could be done efficiently using numba or cython. In special cases, you can implement it using higher level functions in numpy/scipy. I'll show a solution for the specific example you gave, and hopefully you can generalize from there.
Start with some fake data:
A = np.asarray([
[1, 1, 1, 2, 0],
[1, 0, 2, 2, 2],
[0, 2, 0, 1, 0],
[1, 2, 2, 1, 0],
[2, 1, 1, 1, 2]
])
We'll find locations in A where various conditions apply.
1a) The value is 1
1b) The value is greater than its horizontal neighbors
2a) The value is 2
2b) The value is greater than its diagonal neighbors
Find locations in A where the specified values occur:
cond1a = A == 1
cond2a = A == 2
This gives matrices of boolean values, of the same size as A. The value is true where the condition holds, otherwise false.
Find locations in A where each element has the specified relationships to its neighbors:
# condition 1b: value greater than horizontal neighbors
f1 = np.asarray([[1, 0, 1]])
cond1b = A > scipy.ndimage.maximum_filter(
A, footprint=f1, mode='constant', cval=-np.inf)
# condition 2b: value greater than diagonal neighbors
f2 = np.asarray([
[0, 0, 1],
[0, 0, 0],
[1, 0, 0]
])
cond2b = A > scipy.ndimage.maximum_filter(
A, footprint=f2, mode='constant', cval=-np.inf)
As before, this gives matrices of boolean values indicating where the conditions are true. This code uses scipy.ndimage.maximum_filter(). This function iteratively shifts a 'footprint' to be centered over each element of A. The returned value for that position is the maximum of all elements for which the footprint is 1. The mode argument specifies how to treat implicit values outside boundaries of the matrix, where the footprint falls off the edge. Here, we treat them as negative infinity, which is the same as ignoring them (since we're using the max operation).
Set values of the result according to the conditions. The value is 999 if conditions 1a and 1b are both true, or if conditions 2a and 2b are both true. Else, the value is 0.
result = np.zeros(A.shape)
result[(cond1a & cond1b) | (cond2a & cond2b)] = 999
The result is:
[
[ 0, 0, 0, 0, 0],
[999, 0, 0, 999, 999],
[ 0, 0, 0, 999, 0],
[ 0, 0, 999, 0, 0],
[ 0, 0, 0, 0, 999]
]
You can generalize this approach to other patterns of neighbors by changing the filter footprint. You can generalize to other operations (minimum, median, percentiles, etc.) using other kinds of filters (see scipy.ndimage). For operations that can be expressed as weighted sums, use 2d cross correlation.
This approach should be much faster than looping in python. But, it does perform unnecessary computations (for example, it's only necessary to compute the max when the value is 1 or 2, but we're doing it for all elements). Looping manually would let you avoid these computations. Looping in python would probably be much slower than the code here. But, implementing it in numba or cython would probably be faster because these tools generate compiled code.
I used numpy's:
concatenate to pad with zeroes
dstack and roll to align correctly
Apply custom_roll twice along different dimensions and subtract original.
import numpy as np
def custom_roll(a, axis=0):
n = 3
a = a.T if axis==1 else a
pad = np.zeros((n-1, a.shape[1]))
a = np.concatenate([a, pad], axis=0)
ad = np.dstack([np.roll(a, i, axis=0) for i in range(n)])
a = ad.sum(2)[1:-1, :]
a = a.T if axis==1 else a
return a
Consider the following ndarray:
A = np.arange(25).reshape(5, 5)
A
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
sum_of_eight_around_me = custom_roll(custom_roll(A), axis=1) - A
sum_of_eight_around_me
array([[ 12., 20., 25., 30., 20.],
[ 28., 48., 56., 64., 42.],
[ 53., 88., 96., 104., 67.],
[ 78., 128., 136., 144., 92.],
[ 52., 90., 95., 100., 60.]])

Python how to find unique entries and get the minimum values from a matching array

I have a numpy array, indices:
array([[ 0, 0, 0],
[ 0, 0, 0],
[ 2, 0, 2],
[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
I have another array of the same length called distances:
array([ 0.98713981, 1.04705992, 1.42340327, 74.0139111 ,
74.4285216 , 74.84623217])
All of the rows in indices have a match in the distances array. The problem is, there are duplicates in the indices array, and they have different values in the corresponding distances array. I would like to get the minimum distance for all triplets of indices, and discard the others. Therefore, with the inputs above, I want the output:
indicesOUT =
array([[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
distancesOUT=
array([ 0.98713981, 1.42340327, 74.84623217])
My current strategy is as follows:
import numpy as np
indicesOUT = []
distancesOUT = []
for i in range(6):
for j in range(6):
for k in range(6):
if len([s for s in indicesOUT if [i,j,k] == s]) == 0:
current = np.array([i, j, k])
ind = np.where((indices == current).all(-1) == True)[0]
currentDistances = distances[ind]
dist = np.amin(distances)
indicesOUT.append([i, j, k])
distancesOUT.append(dist)
The problem is, the actual arrays have about 4 million elements each, so this approach is way too slow. What is the most efficient way of doing this?
This is essentially a grouping operation, and NumPy is not well-optimized for it. Fortunately, the Pandas package has some very fast tools that can be adapted to this exact problem.
With your data above, we can do this:
import pandas as pd
def drop_duplicates(indices, distances):
data = pd.Series(distances)
grouped = data.groupby(list(indices.T)).min().reset_index()
return grouped.values[:, :3], grouped.values[:, 3]
And the output for your data is
array([[ 0., 0., 0.],
[ 2., 0., 2.],
[ 95., 71., 95.]]),
array([ 0.98713981, 1.42340327, 74.84623217])
My benchmark shows that for 4,000,000 elements, this should run in about a second:
indices = np.random.randint(0, 100, size=(4000000, 3))
distances = np.random.random(4000000)
%timeit drop_duplicates(indices, distances)
# 1 loops, best of 3: 1.15 s per loop
As written above, the input order of the indices will not necessarily be preserved; keeping the original order would require a bit more thought.

Categories