I have two lists of coordinates:
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
I want to find the shortest pairwise distance between l1 and l2. Distance between two coordinates is simply:
numpy.linalg.norm(l1_element - l2_element)
So how do I use numpy to efficiently apply this operation to each pair of elements?
Here is a quick performance analysis of the four methods presented so far:
import numpy
import scipy
from itertools import product
from scipy.spatial.distance import cdist
from scipy.spatial import cKDTree as KDTree
n = 100
l1 = numpy.random.randint(0, 100, size=(n,3))
l2 = numpy.random.randint(0, 100, size=(n,3))
# by #Phillip
def a(l1,l2):
return min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
# by #Kasra
def b(l1,l2):
return numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
# mine
def c(l1,l2):
return numpy.min(scipy.spatial.distance.cdist(l1,l2))
# just checking that numpy.min is indeed faster.
def c2(l1,l2):
return min(scipy.spatial.distance.cdist(l1,l2).reshape(-1))
# by #BrianLarsen
def d(l1,l2):
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = numpy.linalg.norm(l1[0] - l2[0]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minD = min(ans.values())
return minD
for x in (a,b,c,c2,d):
print("Timing variant", x.__name__, ':', flush=True)
print(x(l1,l2), flush=True)
%timeit x(l1,l2)
print(flush=True)
For n=100
Timing variant a :
2.2360679775
10 loops, best of 3: 90.3 ms per loop
Timing variant b :
2.2360679775
10 loops, best of 3: 151 ms per loop
Timing variant c :
2.2360679775
10000 loops, best of 3: 136 µs per loop
Timing variant c2 :
2.2360679775
1000 loops, best of 3: 844 µs per loop
Timing variant d :
2.2360679775
100 loops, best of 3: 3.62 ms per loop
For n=1000
Timing variant a :
0.0
1 loops, best of 3: 9.16 s per loop
Timing variant b :
0.0
1 loops, best of 3: 14.9 s per loop
Timing variant c :
0.0
100 loops, best of 3: 11 ms per loop
Timing variant c2 :
0.0
10 loops, best of 3: 80.3 ms per loop
Timing variant d :
0.0
1 loops, best of 3: 933 ms per loop
Using newaxis and broadcasting, l1[:, None, :] - l2[None, :, :] is an array of the pairwise difference vectors. You can reduce this array to an array of norms using apply_along_axis and then take the min:
numpy.min(numpy.apply_along_axis(
numpy.linalg.norm,
2,
l1[:, None, :] - l2[None, :, :]
))
Of course, this only works if l1 and l2 are numpy arrays, so if your lists in the question weren't pseudo-code, you'll have to add l1 = numpy.array(l1); l2 = numpy.array(l2).
You can use itertools.product to get the all combinations the use min :
l1 = [[x,y,z],[x,y,z],[x,y,z],[x,y,z],[x,y,z]]
l2 = [[x,y,z],[x,y,z],[x,y,z]]
from itertools import product
min(numpy.linalg.norm(l1_element - l2_element) for l1_element,l2_element in product(l1,l2))
If you have many, many, many points this is a great use for a KDTree. Totally overkill for this example but a good learning experience and really fast for a certain class of problems, and can give a bit more information on number of points within a certain distance.
import numpy as np
from scipy.spatial import cKDTree as KDTree
#sample data
l1 = [[0,0,0], [4,5,6], [7,6,7], [4,5,6]]
l2 = [[100,3,4], [1,0,0], [10,15,16], [17,16,17], [14,15,16], [-34, 5, 6]]
# make them arrays
l1 = np.asarray(l1)
l2 = np.asarray(l2)
# make KDTrees for both sets of points
t1 = KDTree(l1)
t2 = KDTree(l2)
# we need a distance to not look beyond, if you have real knowledge use it, otherwise guess
maxD = np.linalg.norm(l1[-1] - l2[-1]) # this could be closest but anyhting further is certainly not
# get a sparce matrix of all the distances
ans = t1.sparse_distance_matrix(t2, maxD)
# get the minimum distance and points involved
minA = min([(i,k) for k, i in ans.iteritems()])
print("Minimun distance is {0} between l1={1} and l2={2}".format(minA[0], l1[minA[1][0]], l2[minA[1][2]] ))
What this does is make a KDTree for the the sets of points then find all the distances for points within the guess distance and give back the distance and closest point. This post has a writeup of how a KDTree works.
Related
I have a simple numpy array (3xN) like:
v = np.array([[-3.33829, -3.42467, -3.53332],
[-2.67681, -2.6082 , -3.49502],
[-3.49497, -2.73177, -2.61499],
[-2.76056, -3.57753, -2.67334],
[-1.96801, -3.47521, -3.51974],
[-1.25571, -2.69451, -3.45554],
[-1.94568, -2.59504, -2.72568],
[-1.28991, -3.47927, -2.73176],
[-0.51201, -3.50684, -3.40448],
[ 0.22398, -2.70244, -3.43421]])
Here, N = 10, but it is much larger than here (+500) in my real case. Each row is a point - Euclidean coordinate.
I would like to carry out:
where i, j and k indicate different rows from v.
How can I implement it on Python in a fast way?
You can do this using numpy broadcasting operations:
diffs = ((v[:, None] - v) ** 2).sum(-1)
d = np.exp(diffs + diffs[:, None]).sum((0, 1))
print(d)
# [3.08316899e+11 2.37020625e+07 4.05357364e+12 8.22697743e+08
# 8.85209202e+04 2.55340202e+05 7.33879459e+04 1.88175133e+05
# 8.10134295e+08 6.62122925e+12]
Even for an array of size 500, the result is computed in just a few seconds:
%%time
v = np.random.rand(500, 3)
diffs = np.sum((v[:, None] - v) ** 2, -1)
d = np.exp(diffs + diffs[:, None]).sum((0, 1))
# CPU times: user 2.74 s, sys: 5.5 ms, total: 2.75 s
# Wall time: 2.75 s
IIUC, the equation suggests pairwise vector differences, and not squared distance between vectors.
The pairwise difference between N vectors will be N*N vectors.
Finally, I would assume since you are only reducing over j and k axes, the output vector is (10,3) and not (10,). Do correct me if I am wrong.
import numpy as np
d = np.exp(((v[:,None]-v)**2)[:,None] + ((v[:,None]-v)**2)).sum((0,1))
print(d)
#### Stepwise breakdown
#v #i,3 -> 10,3
#diff = (v[:,None]-v)**2 #j,i,3 -> 10,10,3
#power = diff[:,None]+diff #k,j,i,3 -> 10,10,10,3
#exp = np.exp(power) #k,j,i,3 -> 10,10,10,3
#d = np.sum(exp,(1,2)) #i,3 -> 10,3
array([[4.38558108e+11, 2.11224470e+02, 2.08153285e+02],
[6.10332697e+09, 2.42309774e+02, 2.00079357e+02],
[1.37237360e+12, 2.11552094e+02, 2.32739462e+02],
[9.98934092e+09, 2.51158071e+02, 2.16562340e+02],
[1.77827910e+08, 2.22151678e+02, 2.05163797e+02],
[1.91234145e+08, 2.19457894e+02, 1.92858561e+02],
[1.63391357e+08, 2.46419838e+02, 2.04498335e+02],
[1.67512751e+08, 2.23119070e+02, 2.03232700e+02],
[8.45322705e+09, 2.30065042e+02, 1.85024981e+02],
[1.14468558e+12, 2.17683864e+02, 1.89388595e+02]])
Benchmark -
%%timeit
np.exp(((v[:,None]-v)**2)[:,None] + ((v[:,None]-v)**2)).sum((0,1))
# 21.2 s ± 3.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I need to compute AB⁻¹ in Python / Numpy for two matrices A and B (B being square, of course).
I know that np.linalg.inv() would allow me to compute B⁻¹, which I can then multiply with A.
I also know that B⁻¹A is actually better computed with np.linalg.solve().
Inspired by that, I decided to rewrite AB⁻¹ in terms of np.linalg.solve().
I got to a formula, based on the identity (AB)ᵀ = BᵀAᵀ, which uses np.linalg.solve() and .transpose():
np.linalg.solve(a.transpose(), b.transpose()).transpose()
that seems to be doing the job:
import numpy as np
n, m = 4, 2
np.random.seed(0)
a = np.random.random((n, n))
b = np.random.random((m, n))
print(np.matmul(b, np.linalg.inv(a)))
# [[ 2.87169378 -0.04207382 -1.10553758 -0.83200471]
# [-1.08733434 1.00110176 0.79683577 0.67487591]]
print(np.linalg.solve(a.transpose(), b.transpose()).transpose())
# [[ 2.87169378 -0.04207382 -1.10553758 -0.83200471]
# [-1.08733434 1.00110176 0.79683577 0.67487591]]
print(np.all(np.isclose(np.matmul(b, np.linalg.inv(a)), np.linalg.solve(a.transpose(), b.transpose()).transpose())))
# True
and also comes up much faster for sufficiently large inputs:
n, m = 400, 200
np.random.seed(0)
a = np.random.random((n, n))
b = np.random.random((m, n))
print(np.all(np.isclose(np.matmul(b, np.linalg.inv(a)), np.linalg.solve(a.transpose(), b.transpose()).transpose())))
# True
%timeit np.matmul(b, np.linalg.inv(a))
# 100 loops, best of 3: 13.3 ms per loop
%timeit np.linalg.solve(a.transpose(), b.transpose()).transpose()
# 100 loops, best of 3: 7.71 ms per loop
My question is: does this identity always stand correct or there are some corner cases I am overlooking?
In general, np.linalg.solve(B, A) is equivalent to B-1A. The rest is just math.
In all cases, (AB)T = BTAT: https://math.stackexchange.com/q/1440305/295281.
Not necessary for this case, but for invertible matrices, (AB)-1 = B-1A-1: https://math.stackexchange.com/q/688339/295281.
For an invertible matrix, it is also the case that (A-1)T = (AT)-1: https://math.stackexchange.com/q/340233/295281.
From that it follows that (AB-1)T = (B-1)TAT = (BT)-1AT. As long as B is invertible, you should have no issues with the transformation you propose in any case.
I'm getting some efficiency test results that I can't explain.
I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is a 2D matrix, and so is x.
I can do this three ways, to test performance I make random (numpy.random.randn) matrices A = (10,1000,1000), x = (1000,1200). I get the following time results:
(1) single multidimensional dot product
B = A.dot(x)
total time: 102.361 s
(2) looping through i and performing 2D dot products
# initialize B = np.zeros([dim1, dim2, dim3])
for i in range(A.shape[0]):
B[i,:,:] = A[i,:,:].dot(x)
total time: 0.826 s
(3) numpy.einsum
B3 = np.einsum("ijk, kl -> ijl", A, x)
total time: 8.289 s
So, option (2) is the fastest by far. But, considering just (1) and (2), I don't see the big difference between them. How can looping through and doing 2D dot products be ~ 124 times faster? They both use numpy.dot. Any insights?
I include the code used for the above results just below:
import numpy as np
import numpy.random as npr
import time
dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)
# consider three ways of assembling the same matrix B: B1, B2, B3
t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)
B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)
t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3
With smaller dims 10,100,200, I get a similar ranking
In [355]: %%timeit
.....: B=np.zeros((N,M,L))
.....: for i in range(N):
B[i,:,:]=np.dot(A[i,:,:],x)
.....:
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop
In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop
In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop
the itererative is faster, though not by as much as in your case.
This is probably true as long as that iterating dimension is small compared to the other ones. In that case the overhead of iteration (function calls etc) is small compared to the calculation time. And doing all the values at once uses more memory.
I tried a dot variation where I reshaped A into 2d, thinking that dot does that kind of reshaping internally. I'm surprised that it is actually fastest. tensordot is probably doing the same reshaping (that code if Python readable).
einsum sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m - that is dim1*dim2*dim2*dim3 steps with the C level nditer. So the more indices you have the larger the iteration space.
numpy.dot only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
When you stick your whole 3-dimensional A array into dot, NumPy takes a slower path, going through an nditer object. It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize.
I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.
Nevertheless, here are my thoughts.
1) numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot's online documentation:
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.
2) Your A array is too large to be loaded on to CPU cache. In your example, you use A with dimensions (10,1000,1000) which gives
In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125
That is almost 80MB, much larger than your cache size. So again you lose most of dgemm's power right there.
3) You are also timing the functions somewhat imprecisely. The time function in Python is known to be not precise. Use timeit instead.
So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache
dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)
def for_dot1(A,x):
for i in range(A.shape[0]):
np.dot(A[i,:,:], x)
def for_dot2(A,x):
for i in range(A.shape[0]):
np.dot(A[:,i,:], x)
def for_dot3(A,x):
for i in range(A.shape[0]):
np.dot(A[:,:,i], x)
and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14):
In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop
Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication. Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for N-D arrays.
I'm getting some efficiency test results that I can't explain.
I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is a 2D matrix, and so is x.
I can do this three ways, to test performance I make random (numpy.random.randn) matrices A = (10,1000,1000), x = (1000,1200). I get the following time results:
(1) single multidimensional dot product
B = A.dot(x)
total time: 102.361 s
(2) looping through i and performing 2D dot products
# initialize B = np.zeros([dim1, dim2, dim3])
for i in range(A.shape[0]):
B[i,:,:] = A[i,:,:].dot(x)
total time: 0.826 s
(3) numpy.einsum
B3 = np.einsum("ijk, kl -> ijl", A, x)
total time: 8.289 s
So, option (2) is the fastest by far. But, considering just (1) and (2), I don't see the big difference between them. How can looping through and doing 2D dot products be ~ 124 times faster? They both use numpy.dot. Any insights?
I include the code used for the above results just below:
import numpy as np
import numpy.random as npr
import time
dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)
# consider three ways of assembling the same matrix B: B1, B2, B3
t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)
B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)
t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3
With smaller dims 10,100,200, I get a similar ranking
In [355]: %%timeit
.....: B=np.zeros((N,M,L))
.....: for i in range(N):
B[i,:,:]=np.dot(A[i,:,:],x)
.....:
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop
In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop
In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop
the itererative is faster, though not by as much as in your case.
This is probably true as long as that iterating dimension is small compared to the other ones. In that case the overhead of iteration (function calls etc) is small compared to the calculation time. And doing all the values at once uses more memory.
I tried a dot variation where I reshaped A into 2d, thinking that dot does that kind of reshaping internally. I'm surprised that it is actually fastest. tensordot is probably doing the same reshaping (that code if Python readable).
einsum sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m - that is dim1*dim2*dim2*dim3 steps with the C level nditer. So the more indices you have the larger the iteration space.
numpy.dot only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
When you stick your whole 3-dimensional A array into dot, NumPy takes a slower path, going through an nditer object. It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize.
I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.
Nevertheless, here are my thoughts.
1) numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot's online documentation:
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.
2) Your A array is too large to be loaded on to CPU cache. In your example, you use A with dimensions (10,1000,1000) which gives
In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125
That is almost 80MB, much larger than your cache size. So again you lose most of dgemm's power right there.
3) You are also timing the functions somewhat imprecisely. The time function in Python is known to be not precise. Use timeit instead.
So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache
dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)
def for_dot1(A,x):
for i in range(A.shape[0]):
np.dot(A[i,:,:], x)
def for_dot2(A,x):
for i in range(A.shape[0]):
np.dot(A[:,i,:], x)
def for_dot3(A,x):
for i in range(A.shape[0]):
np.dot(A[:,:,i], x)
and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14):
In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop
Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication. Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for N-D arrays.
From this question I see how to multiply a whole numpy array with the same number (second answer, by JoshAdel). But when I change P into the maximum of a (long) array, is it better to store the maximum on beforehand, or does it calculate the maximum of H just once in the second example?
import numpy as np
H = [12,12,5,32,6,0.5]
P=H.max()
S=[22, 33, 45.6, 21.6, 51.8]
SP = P*np.array(S)
or
import numpy as np
H = [12,12,5,32,6,0.5]
S=[22, 33, 45.6, 21.6, 51.8]
SP = H.max()*np.array(S)
So does it calculate H.max() for every item it has to multiply, or is it smart enough to it just once? In my code S and H are longer arrays then in the example.
There is little difference between the 2 methods:
In [74]:
import numpy as np
H = np.random.random(100000)
%timeit P=H.max()
S=np.random.random(100000)
%timeit SP = P*np.array(S)
%timeit SP = H.max()*np.array(S)
10000 loops, best of 3: 51.2 µs per loop
10000 loops, best of 3: 165 µs per loop
1000 loops, best of 3: 217 µs per loop
Here you can see that the individual step of pre-calculating H.max() is no different from calculating it in a single line