I want to calculate the squared euclidean distance between two sets of points, inputs and testing. inputs is typically a real array of size ~(200, N), whereas testing is typically ~(1e8, N), and N is around 10. The distances should be scaled in each dimension in N, so I'd be aggregating the expression scale[j]*(inputs[i,j] - testing[ii,j])**2 (where scale is the scaling vector) over N times. I am trying to make this as fast as possible, particularly as N can be large. My first test is
def old_version (inputs, testing, x0):
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros((n, nn))
for d in xrange(d1):
b += x0[d] * (((np.tile(inputs[:, d], (nn, 1)) -
np.tile (testing[:, d], (n, 1)).T))**2).T
return b
Nothing too fancy. I then tried using scipy.spatial.distance.cdist, although I still have to loop through it to get the scaling right
def new_version (inputs, testing, x0):
# import scipy.spatial.distance as dist
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros ((n, nn))
for d in xrange(d1):
b += x0[d] * dist.cdist(inputs[:, d][:, None],
testing[:, d][:, None], 'sqeuclidean')
return b
It would appear that new_version scales better (as N > 1000), but I'm not sure that I've gone as fast as possible here. Any further ideas much appreciated!
This code gave me a factor of 10 over your implementation, give it a try:
x = np.random.randn(200, 10)
y = np.random.randn(1e5, 10)
scale = np.abs(np.random.randn(1, 10))
scale_sqrt = np.sqrt(scale)
dist_map = dist.cdist(x*scale_sqrt, y*scale_sqrt, 'sqeuclidean')
These are the test results:
In [135]: %timeit suggested_version(inputs, testing, x0)
1 loops, best of 3: 341 ms per loop
In [136]: %timeit op_version(inputs, testing, x00) (NOTICE: x00 is a reshape of x0)
1 loops, best of 3: 3.37 s per loop
Just make sure than when you go for the larger N you don't get low on memory. It can really slow things down.
Related
I'm currently writting code where I need to compute as fast as possible a kind of inner product between three 2-D arrays.
Let's call them a,b,c. They all have the same size (N x M).
I want to compute the following 3-d array, op, of size (N x N x N), such that op[i, j, k] is the sum over m of the a[i, m] b[j, m] c[k, m]
(click here for the nice Latex formula)
This is basically an extend version of np.inner to 3 inputs rather than 2.
In practice, the dimensions I will run into are something like N = 100 and M = 300 000. The matrices are not going to be sparse at all, so op contains about 1 million nonzero values.
So far, I've attempted two methods.
The first one uses broadcasting:
import numpy as np
N = 100
M = 300000
a = np.random.randn(N, M)
b = np.random.randn(N, M)
c = np.random.randn(N, M)
def method1(a, b, c):
a_i = a[:, None, None, :]
b_j = b[None, :, None, :]
c_k = c[None, None, :, :]
return np.sum(a_i * b_j * c_k, axis=3)
The problem with this is that it first computes a_i * b_j * c_k which is an N x N x N x M array, so in my case it is simply too much to handle.
I've tried another method using np.einsum, and it is much faster than the previous method:
def method2(a, b, c):
return np.einsum('im,jm,km', a, b, c)
My problem is that it is still too slow. For N = 100 and M = 30 000, it already takes 95 seconds to run on my computer, so taking M to its actual value of 300 000 is impossible.
My question is: do you know any pythonic way to solve my problem (maybe a magic numpy function?), or do I have to resort to things like cython or numba to actually make this computation feasible?
Thanks in advance for any help!
Very interesting one and related to this other problem.
Approach #1: For decent size arrays
Based on the winning approach there at the above mentioned Q&A, here's one solution -
np.tensordot(a[:,None]*b,c,axes=(2,1))
Explanation :
1) a[:,None]*b : Get a 3D array of shape (N, N, M). So, for the use case, it would be (100, 100, 30000), which might be a bit too much for regular systems, but might just work out given some extra system memory juice.
2) np.tensordot(..): Next up, we would sum-reduce that last axis from previous step with tensor-dot against the third array c to have a (100, 100, 100) shaped output array.
Approach #2: For very large arrays and with b identical to c
out = np.zeros((N, N, N))
for i in range(N):
for j in range(N):
for k in range(j+1):
out[i,j,k] = np.einsum('i,i,i->',a[i],b[j],b[k])
r,c = np.triu_indices(N,1)
out[np.arange(N)[:,None], r,c] = out[np.arange(N)[:,None], c,r]
In python I have a three dimensional array T with size (n,n,n), and a 2-D array, with size (n,k).
In linear algebra, the multilinear map defined by T and applied to W, in code would be:
X3 = np.zeros((k,k,k))
for i in xrange(k):
for j in xrange(k):
for t in xrange(k):
for l in xrange(n):
for m in xrange(n)
for h in xrange(n):
X3[i, j, t] += M3[l, m, h] * W[l, i] * W[m, j] * W[h, t]
See
https://en.wikipedia.org/wiki/Multilinear_map
For a reference.
This is very slow. I am wondering if there exist any alternative or any pre build function in numpy that can speed up the operations.
Here's an approach using a series of dot-products -
# Get partial products and thus reach to final output
p1 = np.tensordot(W,M3,axes=(0,0))
p2 = np.tensordot(p1,W,axes=(1,0))
X3out = np.tensordot(p2,W,axes=(1,0))
Einstein summation convention? Use np.einsum!
X3 = np.einsum('lmh,li,mj,ht->ijt', M3, W, W, W)
EDIT:
Out of curiosity, I just ran some benchmarks comparing np.einsum to Divakar's approach. The difference is hugely against np.einsum!
import numpy as np
def approach1(a, b):
x = np.einsum('lmh,li,mj,ht->ijt', a, b, b, b)
def approach2(a, b):
p1 = np.tensordot(b, a, axes=(0,0))
p2 = np.tensordot(p1, b, axes=(1,0))
x = np.tensordot(p2, b, axes=(1,0))
n = 100
k = 10
a = np.random.random((n, n, n))
b = np.random.random((n, k))
%timeit approach1(a, b) # => 1 loop, best of 3: 26 s per loop
%timeit approach2(a, b) # => 100 loops, best of 3: 4.23 ms per loop
There's some discussion about it in this question. It all seems to come down to the generality that np.einsum tries to achieve — at the cost of being able to offload computations to low-level linear algebra packages.
I have to compute the Kullback-Leibler Divergence (KLD) between thousands of discrete probability vectors. Currently I am using the following code but it's way too slow for my purposes. I was wondering if there is any faster way to compute KL Divergence?
import numpy as np
import scipy.stats as sc
#n is the number of data points
kld = np.zeros(n, n)
for i in range(0, n):
for j in range(0, n):
if(i != j):
kld[i, j] = sc.entropy(distributions[i, :], distributions[j, :])
Scipy's stats.entropy in its default sense invites inputs as 1D arrays giving us a scalar, which is being done in the listed question. Internally this function also allows broadcasting, which we can abuse in here for a vectorized solution.
From the docs -
scipy.stats.entropy(pk, qk=None, base=None)
If only probabilities pk
are given, the entropy is calculated as S = -sum(pk * log(pk),
axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S =
sum(pk * log(pk / qk), axis=0).
In our case, we are doing these entropy calculations for each row against all rows, performing sum reductions to have a scalar at each iteration with those two nested loops. Thus, the output array would be of shape (M,M), where M is the number of rows in input array.
Now, the catch here is that stats.entropy() would sum along axis=0, so we will feed it two versions of distributions, both of whom would have the rowth-dimension brought to axis=0 for reduction along it and the other two axes interleaved - (M,1) & (1,M) to give us a (M,M) shaped output array using broadcasting.
Thus, a vectorized and much more efficient way to solve our case would be -
from scipy import stats
kld = stats.entropy(distributions.T[:,:,None], distributions.T[:,None,:])
Runtime tests and verify -
In [15]: def entropy_loopy(distrib):
...: n = distrib.shape[0] #n is the number of data points
...: kld = np.zeros((n, n))
...: for i in range(0, n):
...: for j in range(0, n):
...: if(i != j):
...: kld[i, j] = stats.entropy(distrib[i, :], distrib[j, :])
...: return kld
...:
In [16]: distrib = np.random.randint(0,9,(100,100)) # Setup input
In [17]: out = stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
In [18]: np.allclose(entropy_loopy(distrib),out) # Verify
Out[18]: True
In [19]: %timeit entropy_loopy(distrib)
1 loops, best of 3: 800 ms per loop
In [20]: %timeit stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
10 loops, best of 3: 104 ms per loop
I have to apply some mathematical formula that I've written
in python as:
for s in range(tdim):
sum1 = 0.0
for i in range(dim):
for j in range(dim):
sum1+=0.5*np.cos(theta[s]*(i-j))*
eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*eig1[j]*eig2[i]-eig1[i]*eig2[j])
PHi2.append(sum1)
Now, this is correct, but clearly inefficient, the other way around is to do:
for i in range(dim):
for j in range(dim):
PHi2 = 0.5*np.cos(theta*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])
However, the second example gives me the same number in all elements of PHi2, so this
is faster but answer is wrong. How can you do this correctly and more efficiently?
NOTE: eig1 and eig2 are of the same dimension d, theta and PHi2 are the same dimension D,
BUT d!=D.
You can use a brute force broadcasting approach, but you are creating an intermediate array of shape (D, d, d), which can get out of hand if your arrays are even moderately large. Furthermore, in using broadcasting with no refinements you are recomputing a lot of calculations from the innermost loop that you only need to do once. If you first compute the necessary parameters for all possible values of i - j and add them together, you can reuse those values on the outer loop, e.g.:
def fast_ops(eig1, eig2, theta):
d = len(eig1)
d_arr = np.arange(d)
i_j = d_arr[:, None] - d_arr[None, :]
reidx = i_j + d - 1
mult1 = eig1[:, None] * eig1[ None, :] + eig2[:, None] + eig2[None, :]
mult2 = eig1[None, :] * eig2[:, None] - eig1[:, None] * eig2[None, :]
mult1_reidx = np.bincount(reidx.ravel(), weights=mult1.ravel())
mult2_reidx = np.bincount(reidx.ravel(), weights=mult2.ravel())
angles = theta[:, None] * np.arange(1 - d, d)
return 0.5 * (np.einsum('ij,j->i', np.cos(angles), mult1_reidx) -
np.einsum('ij,j->i', np.sin(angles), mult2_reidx))
IF we rewrite M4rtini's code as a function for comparison:
def fast_ops1(eig1, eig2, theta):
d = len(eig1)
D = len(theta)
s = np.array(range(D))[:, None, None]
i = np.array(range(d))[:, None]
j = np.array(range(d))
ret = 0.5 * (np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j]) -
np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j]))
return ret.sum(axis=(-1, -2))
And we make up some data:
d, D = 100, 200
eig1 = np.random.rand(d)
eig2 = np.random.rand(d)
theta = np.random.rand(D)
The speed improvement is very noticeable, 80x on top of the 115x over your original code, leading to a whooping 9000x speed-up:
In [22]: np.allclose(fast_ops1(eig1, eig2, theta), fast_ops(eig1, eig2, theta))
Out[22]: True
In [23]: %timeit fast_ops1(eig1, eig2, theta)
10 loops, best of 3: 145 ms per loop
In [24]: %timeit fast_ops(eig1, eig2, theta)
1000 loops, best of 3: 1.85 ms per loop
This works by broadcasting.
For tdim = 200 and dim = 100.
14 seconds with original.
120 ms with the version.
s = np.array(range(tdim))[:, None, None]
i = np.array(range(dim))[:, None]
j = np.array(range(dim))
PHi2 =(0.5*np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])).sum(axis=2).sum(axis=1)
In the first bit of code, you have 0.5*np.cos(theta[s]*(i-j))... but in the second it's 0.5*np.cos(theta*(i-j)).... Unless you've got theta defined differently for the second bit of code, this could well be the cause of the trouble.
A numerical integration is taking exponentially longer than I expect it to. I would like to know if the way that I implement the iteration over the mesh could be a contributing factor. My code looks like this:
import numpy as np
import itertools as it
U = np.linspace(0, 2*np.pi)
V = np.linspace(0, np.pi)
for (u, v) in it.product(U,V):
# values = computation on each grid point, does not call any outside functions
# solution = sum(values)
return solution
I left out the computations because they are long and my question is specifically about the way that I have implemented the computation over the parameter space (u, v). I know of alternatives such as numpy.meshgrid; however, these all seem to create instances of (very large) matrices, and I would guess that storing them in memory would slow things down.
Is there an alternative to it.product that would speed up my program, or should I be looking elsewhere for the bottleneck?
Edit: Here is the for loop in question (to see if it can be vectorized).
import random
import numpy as np
import itertools as it
##########################################################################
# Initialize the inputs with random (to save space)
##########################################################################
mat1 = np.array([[random.random() for i in range(3)] for i in range(3)])
mat2 = np.array([[random.random() for i in range(3)] for i in range(3)])
a1, a2, a3 = np.array([random.random() for i in range(3)])
plane_normal = np.array([random.random() for i in range(3)])
plane_point = np.array([random.random() for i in range(3)])
d = np.dot(plane_normal, plane_point)
truthval = True
##########################################################################
# Initialize the loop
##########################################################################
N = 100
U = np.linspace(0, 2*np.pi, N + 1, endpoint = False)
V = np.linspace(0, np.pi, N + 1, endpoint = False)
U = U[1:N+1] V = V[1:N+1]
Vsum = 0
Usum = 0
##########################################################################
# The for loops starts here
##########################################################################
for (u, v) in it.product(U,V):
cart_point = np.array([a1*np.cos(u)*np.sin(v),
a2*np.sin(u)*np.sin(v),
a3*np.cos(v)])
surf_normal = np.array(
[2*x / a**2 for (x, a) in zip(cart_point, [a1,a2,a3])])
differential_area = \
np.sqrt((a1*a2*np.cos(v)*np.sin(v))**2 + \
a3**2*np.sin(v)**4 * \
((a2*np.cos(u))**2 + (a1*np.sin(u))**2)) * \
(np.pi**2 / (2*N**2))
if (np.dot(plane_normal, cart_point) - d > 0) == truthval:
perp_normal = plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Vsum += f*differential_area
else:
perp_normal = - plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Usum += f*differential_area
integral = abs(Vsum) + abs(Usum)
If U.shape == (nu,) and (V.shape == (nv,), then the following arrays vectorize most of your calculations. With numpy you get the best speed by using arrays for the largest dimensions, and looping on the small ones (e.g. 3x3).
Corrected version
A = np.cos(U)[:,None]*np.sin(V)
B = np.sin(U)[:,None]*np.sin(V)
C = np.repeat(np.cos(V)[None,:],U.size,0)
CP = np.dstack([a1*A, a2*B, a3*C])
SN = np.dstack([2*A/a1, 2*B/a2, 2*C/a3])
DA1 = (a1*a2*np.cos(V)*np.sin(V))**2
DA2 = a3*a3*np.sin(V)**4
DA3 = (a2*np.cos(U))**2 + (a1*np.sin(U))**2
DA = DA1 + DA2 * DA3[:,None]
DA = np.sqrt(DA)*(np.pi**2 / (2*Nu*Nv))
D = np.dot(CP, plane_normal)
S = np.sign(D-d)
F1 = np.dot(np.dot(SN, mat2.T), plane_normal)
F = F1 * DA
#F = F * S # apply sign
Vsum = F[S>0].sum()
Usum = F[S<=0].sum()
With the same random values, this produces the same values. On a 100x100 case, it is 10x faster. It's been fun playing with these matrices after a year.
In ipython I did simple sum calculations on your 50 x 50 gridspace
In [31]: sum(u*v for (u,v) in it.product(U,V))
Out[31]: 12337.005501361698
In [33]: UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
Out[33]: 12337.005501361693
In [34]: timeit UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
1000 loops, best of 3: 293 us per loop
In [35]: timeit sum(u*v for (u,v) in it.product(U,V))
100 loops, best of 3: 2.95 ms per loop
In [38]: timeit list(it.product(U,V))
1000 loops, best of 3: 213 us per loop
In [45]: timeit UU,VV = np.meshgrid(U,V); (UU*VV).sum().sum()
10000 loops, best of 3: 70.3 us per loop
# using numpy's own sum is even better
product is slower (by factor 10), not because product itself is slow, but because of the point by point calculation. If you can vectorize your calculations so they use the 2 (50,50) arrays (without any sort of looping) it should speed up the overall time. That's the main reason for using numpy.
[k for k in it.product(U,V)] runs in 2ms for me, and the itertool package is made to be efficient, e.g. it does not create a long array first (http://docs.python.org/2/library/itertools.html).
The culprit seems to be your code inside the iteration, or your using a lot of points in linspace.