I have matrices that are 2 x 4 and 3 x 4. I want to find the euclidean distance across rows, and get a 2 x 3 matrix at the end. Here is the code with one for loop that computes the euclidean distance for every row vector in a against all b row vectors. How do I do the same without using for loops?
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
dists = np.zeros((2, 3))
for i in range(2):
dists[i] = np.sqrt(np.sum(np.square(a[i] - b), axis=1))
Here are the original input variables:
A = np.array([[1,1,1,1],[2,2,2,2]])
B = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
A
# array([[1, 1, 1, 1],
# [2, 2, 2, 2]])
B
# array([[1, 2, 3, 4],
# [1, 1, 1, 1],
# [1, 2, 1, 9]])
A is a 2x4 array.
B is a 3x4 array.
We want to compute the Euclidean distance matrix operation in one entirely vectorized operation, where dist[i,j] contains the distance between the ith instance in A and jth instance in B. So dist is 2x3 in this example.
The distance
could ostensibly be written with numpy as
dist = np.sqrt(np.sum(np.square(A-B))) # DOES NOT WORK
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# ValueError: operands could not be broadcast together with shapes (2,4) (3,4)
However, as shown above, the problem is that the element-wise subtraction operation A-B involves incompatible array sizes, specifically the 2 and 3 in the first dimension.
A has dimensions 2 x 4
B has dimensions 3 x 4
In order to do element-wise subtraction, we have to pad either A or B to satisfy numpy's broadcast rules. I'll choose to pad A with an extra dimension so that it becomes 2 x 1 x 4, which allows the arrays' dimensions to line up for broadcasting. For more on numpy broadcasting, see the tutorial in the scipy manual and the final example in this tutorial.
You can perform the padding with either np.newaxis value or with the np.reshape command. I show both below:
# First approach is to add the extra dimension to A with np.newaxis
A[:,np.newaxis,:] has dimensions 2 x 1 x 4
B has dimensions 3 x 4
# Second approach is to reshape A with np.reshape
np.reshape(A, (2,1,4)) has dimensions 2 x 1 x 4
B has dimensions 3 x 4
As you can see, using either approach will allow the dimensions to line up. I'll use the first approach with np.newaxis. So now, this will work to create A-B, which is a 2x3x4 array:
diff = A[:,np.newaxis,:] - B
# Alternative approach:
# diff = np.reshape(A, (2,1,4)) - B
diff.shape
# (2, 3, 4)
Now we can put that difference expression into the dist equation statement to get the final result:
dist = np.sqrt(np.sum(np.square(A[:,np.newaxis,:] - B), axis=2))
dist
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Note that the sum is over axis=2, which means take the sum over the 2x3x4 array's third axis (where the axis id starts with 0).
If your arrays are small, then the above command will work just fine. However, if you have large arrays, then you may run into memory issues. Note that in the above example, numpy internally created a 2x3x4 array to perform the broadcasting. If we generalize A to have dimensions a x z and B to have dimensions b x z, then numpy will internally create an a x b x z array for broadcasting.
We can avoid creating this intermediate array by doing some mathematical manipulation. Because you are computing the Euclidean distance as a sum-of-squared-differences, we can take advantage of the mathematical fact that sum-of-squared-differences can be rewritten.
Note that the middle term involves the sum over element-wise multiplication. This sum over multiplcations is better known as a dot product. Because A and B are each a matrix, then this operation is actually a matrix multiplication. We can thus rewrite the above as:
We can then write the following numpy code:
threeSums = np.sum(np.square(A)[:,np.newaxis,:], axis=2) - 2 * A.dot(B.T) + np.sum(np.square(B), axis=1)
dist = np.sqrt(threeSums)
dist
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Note that the answer above is exactly the same as the previous implementation. Again, the advantage here is the we do not need to create the intermediate 2x3x4 array for broadcasting.
For completeness, let's double-check that the dimensions of each summand in threeSums allowed broadcasting.
np.sum(np.square(A)[:,np.newaxis,:], axis=2) has dimensions 2 x 1
2 * A.dot(B.T) has dimensions 2 x 3
np.sum(np.square(B), axis=1) has dimensions 1 x 3
So, as expected, the final dist array has dimensions 2x3.
This use of the dot product in lieu of sum of element-wise multiplication is also discussed in this tutorial.
I had the same problem recently working with deep learning(stanford cs231n,Assignment1),but when I used
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
There was a error
MemoryError
That means I ran out of memory(In fact,that produced a array of 500*5000*1024 in the middle.It's so huge!)
To prevent that error,we can use a formula to simplify:
code:
import numpy as np
aSumSquare = np.sum(np.square(a),axis=1);
bSumSquare = np.sum(np.square(b),axis=1);
mul = np.dot(a,b.T);
dists = np.sqrt(aSumSquare[:,np.newaxis]+bSumSquare-2*mul)
Simply use np.newaxis at the right place:
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
This functionality is already included in scipy's spatial module and I recommend using it as it will be vectorized and highly optimized under the hood. But, as evident by the other answer, there are ways you can do this yourself.
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
from scipy.spatial.distance import cdist
cdist(a,b)
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Using numpy.linalg.norm also works well with broadcasting. Specifying an integer value for axis will use a vector norm, which defaults to Euclidean norm.
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
np.linalg.norm(a[:, np.newaxis] - b, axis = 2)
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Related
I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.
Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!
This solution implements the distance in numpy, as I think it is a good example of broadcasting, which is a very useful thing to know if you need to use arrays and matrices.
By definition of Manhattan distance, you need to evaluate the sum of the absolute value of difference between each column. However, the first column of x, x[:, 0], has shape (4,) and the first column of y, y[:, 0], has shape (2,), so they are not compatible in the sense of applying subtraction: the broadcasting property says that each shape is compared starting with the trailing dimensions and two dimensions are compatible when they are equal or one of them is 1. Sadly, none of them are true for your columns.
However, you can add a new dimension of value 1 using np.newaxis, so
x[:, 0]
is array([1, 2, 4, 5]), but
x[:, 0, np.newaxis]
is
array([[1],
[2],
[4],
[5]])
and its shape is (4 ,1). Now, a matrix of shape (4, 1) subtracted by an array of shape 2 results in a matrix of shape (4, 2), by numpy's broadcasting treatment:
4 x 1
2
= 4 x 2
You can obtain the differences for each column:
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
and evaluate the sum of their absolute values:
np.abs(first_column_difference) + np.abs(second_column_difference)
which results in a (4, 2) matrix. Now, you want to sum the values for each row, so that you have 4 values:
np.sum(np.abs(first_column_difference) + np.abs(second_column_difference), axis=1)
which results in array([73, 69, 61, 57]). The rule is simple: the parameter axis will eliminate that dimension from the result, therefore using axis=1 for a (4, 2) matrix generates 4 values -- if you use axis=0, it will generate 2 values.
So, this will solve your problem:
x = np.array([[1, 2], [2, 3], [4, 5], [5, 6]])
y = np.array([[11, 13], [12, 43]])
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
z = np.abs(first_column_difference) + np.abs(second_column_difference)
print(np.sum(z, axis=1))
You can also skip the intermediate steps for each column and evaluate everything at once (it is a little bit harder to understand, so I prefer the method described above to explain what is happening):
print(np.abs(x[:, np.newaxis] - y).sum(axis=(1, 2)))
It is a general case for an n-dimensional Manhattan distance: if x is (u, n) and y is (v, n), it generates u rows by broadcasting (u, 1, n) by (v, n) = (u, v, n), then applying sum to eliminate the second and third axis.
Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
last element of shape represent x and y of a point, size 2
we can think of start_points as having 1 row and 4 columns of points
we can think of dest_points as having 2 rows and 1 columns of points
We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
Result is arithmetic operation between every pair of elements on start_points and dest_points
Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more
With so many rows you can make substantial savings by using a smart algorithm. Let us for simplicity assume there is just one dimension; once we have established the algorithm, getting back to the general case is a simple matter of summing over coordinates.
The naive algorithm is O(mn) where m,n are the sizes of sets X,Y. Our algorithm is O((m+n)log(m+n)) so it scales much better.
We first have to sort the union of X and Y by coordinate and then form the cumsum over Y. Next, we find for each x in X the number YbefX of y in Y to its left and use it to look up the corresponding cumsum item YbefXval. The summed distances to all y to the left of x are YbefX times coordinate of x minus YbefXval, the distances to all y to the right are sum of all y coordinates minus YbefXval minus n - YbefX times coordinate of x.
Where does the saving come from? Sorting coordinates enables us to recycle the summations we have done before, instead of starting each time from scratch. This uses the fact that up to a sign we always sum the same y coordinates and going from left to right the signs flip one by one.
Code:
import numpy as np
from scipy.spatial.distance import cdist
from timeit import timeit
def pp(X,Y):
(m,k),(n,k) = X.shape,Y.shape
XY = np.concatenate([X.T,Y.T],1)
idx = XY.argsort(1)
Xmsk = idx<m
Ymsk = ~Xmsk
Xidx = np.arange(k)[:,None],idx[Xmsk].reshape(k,m)
Yidx = np.arange(k)[:,None],idx[Ymsk].reshape(k,n)
YbefX = Ymsk.cumsum(1)[Xmsk].reshape(k,m)
YbefXval = XY[Yidx].cumsum(1)[np.arange(k)[:,None],YbefX-1]
YbefXval[YbefX==0] = 0
XY[Xidx] = ((2*YbefX-n)*XY[Xidx]) - 2*YbefXval + Y.sum(0)[:,None]
return XY[:,:m].sum(0)
def summed_cdist(X,Y):
return cdist(X,Y,"minkowski",p=1).sum(1)
# demo
m,n,k = 1000,500,10
X,Y = np.random.randn(m,k),np.random.randn(n,k)
print("same result:",np.allclose(pp(X,Y),summed_cdist(X,Y)))
print("sort :",timeit(lambda:pp(X,Y),number=1000),"ms")
print("scipy cdist:",timeit(lambda:summed_cdist(X,Y),number=100)*10,"ms")
Sample run, comparing smart algo "sort" to naive algo implemented using cdist library function:
same result: True
sort : 1.4447695480193943 ms
scipy cdist: 36.41934019047767 ms
I have a function in MATLAB
[b,g] = sgolay(k, f);
It outputs a f x f matrix.
When I run the same for the same values of k and f in Python, using:
scipy.signal.savgol_coeffs(f, k)
It outputs an entirely different array of only f elements.
The values in consideration are:
k = 4, f = 21
savol_filter() takes three arguments, including the array, whereas sgolay() takes only two. Also, the savo_coeffs is not generating the required matrix.
What is the Python equivalent for obtaining the matrix generated by sgolay(k, f) in matlab?
If you inspect the matrix b returned by Matlab's sgolay function, you'll see that the center row is the same as the 1-d array returned by SciPy's savgol_coeffs. The upper and lower halves of b, with (framelen - 1)/2 rows in each part, are the coefficients of the Savitzky-Golay filters to be applied to the ends of the signal, where the filter is not symmetric. That is, for the (framelen - 1)/2 values at each end of the signal, each filtered value is computed using a different set of coefficients.
You can use savgol_coeffs to generate b by iterating over the pos argument. The following ipython session shows an example.
In [74]: import numpy as np
In [75]: from scipy.signal import savgol_coeffs
In [76]: np.set_printoptions(precision=11, linewidth=90)
In [77]: order = 3
In [78]: windowlen = 5
These are the coefficients of the symmetric (i.e. centered) Savitzky-Golay filter. The 1-d array should match the center row of the matrix returned by sgolay:
In [79]: savgol_coeffs(windowlen, order)
Out[79]: array([-0.08571428571, 0.34285714286, 0.48571428571, 0.34285714286, -0.08571428571])
If we set pos=windowlen-1, we get coefficients designed for evaluating the filter at one end of the window. These should match the first row of the array returned by sgolay:
In [80]: savgol_coeffs(windowlen, order, pos=windowlen-1)
Out[80]: array([ 0.98571428571, 0.05714285714, -0.08571428571, 0.05714285714, -0.01428571429])
Similarly, pos=0 gives the coefficients for the other end of the window. These should match the last row of the matrix returned by sgolay:
In [81]: savgol_coeffs(windowlen, order, pos=0)
Out[81]: array([-0.01428571429, 0.05714285714, -0.08571428571, 0.05714285714, 0.98571428571])
Here's the full array to match the return value of Matlab's sgolay:
In [82]: b = np.array([savgol_coeffs(windowlen, order, pos=p) for p in range(windowlen-1, -1, -1)])
In [83]: b
Out[83]:
array([[ 0.98571428571, 0.05714285714, -0.08571428571, 0.05714285714, -0.01428571429],
[ 0.05714285714, 0.77142857143, 0.34285714286, -0.22857142857, 0.05714285714],
[-0.08571428571, 0.34285714286, 0.48571428571, 0.34285714286, -0.08571428571],
[ 0.05714285714, -0.22857142857, 0.34285714286, 0.77142857143, 0.05714285714],
[-0.01428571429, 0.05714285714, -0.08571428571, 0.05714285714, 0.98571428571]])
If you compare this to the result of b = sgolay(3, 5) in Matlab, you'll see that they are the same.
To get the g matrix returned by sgolay, you'll have to call savgol_coeffs with deriv set to the values in range(order+1), reverse and transpose the array, and scale by the factorial of the derivative order. To reverse the coefficients, you could use a slice of the form ::-1, or you can use the use option of savgol_coeffs.
Here's one way to use savgol_coeffs to generate the g matrix with order=3 and windowlen=5:
In [12]: import numpy as np
In [13]: from scipy.signal import savgol_coeffs
In [14]: from scipy.special import factorial
In [15]: np.set_printoptions(precision=11, linewidth=90, suppress=True)
In [16]: order = 3
In [17]: windowlen = 5
In [18]: g = np.array([savgol_coeffs(windowlen, order, deriv=d, use='dot') for d in range(order+1)]).T / factorial(np.arange(order+1))
In [19]: g
Out[19]:
array([[-0.08571428571, 0.08333333333, 0.14285714286, -0.08333333333],
[ 0.34285714286, -0.66666666667, -0.07142857143, 0.16666666667],
[ 0.48571428571, 0. , -0.14285714286, 0. ],
[ 0.34285714286, 0.66666666667, -0.07142857143, -0.16666666667],
[-0.08571428571, -0.08333333333, 0.14285714286, 0.08333333333]])
You don't say why you want the full windowlen x windowlen array in Python. You don't need it to use savgol_filter.
I have two numpy arrays. 'A' of size w,h,2 and 'B' with n,2.
In other words, A is a 2-dimensional array of 2D vectors while B is a 1D array of 2D vectors.
What i want as a result is an array of size w,h,n. The last dimension is an n-dimensional vector where each of the components is the euclidean distance between the corresponding vector from A (denoted by the first two dimensions w and h) and the nth vector of B.
I know that i can just loop through w, h and n in python manually and calculate the distance for each element, but i like to know if there is a smart way to do that with numpy operations to increase performance.
I found some similar questions but unfortunately all of those use input arrays of the same dimensionality.
Approach #1
You could reshape A to 2D, use Scipy's cdist that expects 2D arrays as inputs, get those euclidean distances and finally reshape back to 3D.
Thus, an implementation would be -
from scipy.spatial.distance import cdist
out = cdist(A.reshape(-1,2),B).reshape(w,h,-1)
Approach #2
Since, the axis of reduction is of length 2 only, we can just slice the input arrays to save memory on intermediate arrays, like so -
np.sqrt((A[...,0,None] - B[:,0])**2 + (A[...,1,None] - B[:,1])**2)
Explanation on A[...,0,None] and A[...,1,None] :
With that None we are just introducing a new axis at the end of sliced A. Well, let's take a small example -
In [54]: A = np.random.randint(0,9,(4,5,2))
In [55]: A[...,0].shape
Out[55]: (4, 5)
In [56]: A[...,0,None].shape
Out[56]: (4, 5, 1)
In [57]: B = np.random.randint(0,9,(3,2))
In [58]: B[:,0].shape
Out[58]: (3,)
So, we have :
A[...,0,None] : 4 x 5 x 1
B[:,0] : 3
That is essentially :
A[...,0,None] : 4 x 5 x 1
B[:,0] : 1 x 1 x 3
When the subtraction is performed, the singleton dims are broadcasted corresponding to the dimensions of the other participating arrays -
A[...,0,None] - B : 4 x 5 x 3
We repeat this for the second index along the last axis. We add these two arrays after squaring and finally a square-root to get the final eucl. distances.
In numpy operation, I have two vectors, let's say vector A is 4X1, vector B is 1X5, if I do AXB, it should result a matrix of size 4X5.
But I tried lot of times, doing many kinds of reshape and transpose, they all either raise error saying not aligned or return a single value.
How should I get the output product of matrix I want?
Normal matrix multiplication works as long as the vectors have the right shape. Remember that * in Numpy is elementwise multiplication, and matrix multiplication is available with numpy.dot() (or with the # operator, in Python 3.5)
>>> numpy.dot(numpy.array([[1], [2]]), numpy.array([[3, 4]]))
array([[3, 4],
[6, 8]])
This is called an "outer product." You can get it using plain vectors using numpy.outer():
>>> numpy.outer(numpy.array([1, 2]), numpy.array([3, 4]))
array([[3, 4],
[6, 8]])
If you are using numpy.
First, make sure you have two vectors. For example, vec1.shape = (10, ) and vec2.shape = (26, ); in numpy, row vector and column vector are the same thing.
Second, you do res_matrix = vec1.reshape(10, 1) # vec2.reshape(1, 26) ;.
Finally, you should have: res_matrix.shape = (10, 26).
numpy documentation says it will deprecate np.matrix(), so better not use it.
Function matmul (since numpy 1.10.1) works fine:
import numpy as np
a = np.array([[1],[2],[3],[4]])
b = np.array([[1,1,1,1,1],])
ab = np.matmul(a, b)
print (ab)
print(ab.shape)
You have to declare your vectors right. The first has to be a list of lists of one number (this vector has to have columns in one row), and the second - a list of list (this vector has to have rows in one column) like in above example.
Output:
[[1 1 1 1 1]
[2 2 2 2 2]
[3 3 3 3 3]
[4 4 4 4 4]]
(4, 5)
Using Python & Numpy, I would like to:
Consider each row of an (n columns x
m rows) matrix as a vector
Weight each row (scalar
multiplication on each component of
the vector)
Add each row to create a final vector
(vector addition).
The weights are given in a regular numpy array, n x 1, so that each vector m in the matrix should be multiplied by weight n.
Here's what I've got (with test data; the actual matrix is huge), which is perhaps very un-Numpy and un-Pythonic. Can anyone do better? Thanks!
import numpy
# test data
mvec1 = numpy.array([1,2,3])
mvec2 = numpy.array([4,5,6])
start_matrix = numpy.matrix([mvec1,mvec2])
weights = numpy.array([0.5,-1])
#computation
wmatrix = [ weights[n]*start_matrix[n] for n in range(len(weights)) ]
vector_answer = [0,0,0]
for x in wmatrix: vector_answer+=x
Even a 'technically' correct answer has been all ready given, I'll give my straightforward answer:
from numpy import array, dot
dot(array([0.5, -1]), array([[1, 2, 3], [4, 5, 6]]))
# array([-3.5 -4. -4.5])
This one is much more on with the spirit of linear algebra (and as well those three dotted requirements on top of the question).
Update:
And this solution is really fast, not marginally, but easily some (10- 15)x faster than all ready proposed one!
It will be more convenient to use a two-dimensional numpy.array than a numpy.matrix in this case.
start_matrix = numpy.array([[1,2,3],[4,5,6]])
weights = numpy.array([0.5,-1])
final_vector = (start_matrix.T * weights).sum(axis=1)
# array([-3.5, -4. , -4.5])
The multiplication operator * does the right thing here due to NumPy's broadcasting rules.