I have written a Python function that computes pairwise electromagnetic interactions between a largish number (N ~ 10^3) of particles and stores the results in an NxN complex128 ndarray. It runs, but it is the slowest part of a larger program, taking about 40 seconds when N=900 [corrected]. The original code looks like this:
import numpy as np
def interaction(s,alpha,kprop): # s is an Nx3 real array
# alpha is complex
# kprop is float
ndipoles = s.shape[0]
Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=np.complex128)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
im = complex(0,1)
k2 = kprop*kprop
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
I had never previously used Cython, but that seemed like a good place to start in my effort to speed things up, so I pretty much blindly adapted the techniques I found in online tutorials. I got some speedup (30 seconds vs. 40 seconds), but not nearly as dramatic as I expected, so I'm wondering whether I'm doing something wrong or am missing a critical step. The following is my best attempt at cythonizing the above routine:
import numpy as np
cimport numpy as np
DTYPE = np.complex128
ctypedef np.complex128_t DTYPE_t
def interaction(np.ndarray s, DTYPE_t alpha, float kprop):
cdef float k2 = kprop*kprop
cdef int i,j
cdef np.ndarray xi, xj, dx, n, nxn
cdef float R, kR, kR2
cdef DTYPE_t A
cdef int ndipoles = s.shape[0]
cdef np.ndarray Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=DTYPE)
cdef np.ndarray I = np.array([[1,0,0],[0,1,0],[0,0,1]])
cdef DTYPE_t im = complex(0,1)
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
The real power of NumPy is in performing an operation across a huge number of elements in a vectorized manner instead of using that operation in chunks spread across loops. In your case, you are using two nested loops and one IF conditional statement. I would propose extending the dimensions of the intermediate arrays, which would bring in NumPy's powerful broadcasting capability to come into play and thus the same operations could be used on all elements in one go instead of small chunks of data within the loops.
For extending the dimensions, None/np.newaxis could be used. So, the vectorized implementation to follow such a premise would look like this -
def vectorized_interaction(s,alpha,kprop):
im = complex(0,1)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
k2 = kprop*kprop
# Vectorized calculations for dx, R, n, kR, A
sd = s[:,None] - s
Rv = np.sqrt((sd**2).sum(2))
nv = sd/Rv[:,:,None]
kRv = Rv*kprop
Av = (1./(kRv*kRv)) - im/kRv
# Vectorized calculation for: "nxn = np.outer(n, n)"
nxnv = nv[:,:,:,None]*nv[:,:,None,:]
# Vectorized calculation for: "(3*A-1)*nxn + (1-A)*I"
P = (3*Av[:,:,None,None]-1)*nxnv + (1-Av[:,:,None,None])*I
# Vectorized calculation for: "-alpha*(k2*np.exp(im*kR))/R"
multv = -alpha*(k2*np.exp(im*kRv))/Rv
# Vectorized calculation for: "nxn *= -alpha*(k2*np.exp(im*kR))/R"
outv = P*multv[:,:,None,None]
# Simulate ELSE part of the conditional statement"if i != j:"
# with masked setting to I on the last two dimensions
outv[np.eye((N),dtype=bool)] = I
return outv.transpose(0,2,1,3).reshape(N*3,-1)
Runtime tests and output verification -
Case #1:
In [703]: N = 10
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [704]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [705]: %timeit interaction(s,alpha,kprop)
100 loops, best of 3: 7.6 ms per loop
In [706]: %timeit vectorized_interaction(s,alpha,kprop)
1000 loops, best of 3: 304 µs per loop
Case #2:
In [707]: N = 100
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [708]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [709]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 826 ms per loop
In [710]: %timeit vectorized_interaction(s,alpha,kprop)
100 loops, best of 3: 14 ms per loop
Case #3:
In [711]: N = 900
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [712]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [713]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 1min 7s per loop
In [714]: %timeit vectorized_interaction(s,alpha,kprop)
1 loops, best of 3: 1.59 s per loop
Related
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
I want to implement the following problem in numpy and here is my code.
I've tried the following numpy code for this problem with one for loop. I am wondering if there is any more efficient way of doing this calculation? I really appreciate that!
k, d = X.shape
m = Y.shape[0]
c1 = 2.0*sigma**2
c2 = 0.5*np.log(np.pi*c1)
c3 = np.log(1.0/k)
L_B = np.zeros((m,))
for i in xrange(m):
if i % 100 == 0:
print i
L_B[i] = np.log(np.sum(np.exp(np.sum(-np.divide(
np.power(X-Y[i,:],2), c1)-c2,1)+c3)))
print np.mean(L_B)
I've thought of np.expand_dims(X, 2).repeat(Y.shape[0], 2)-Y by creating a 3D tensor therefore the following calculation can be done by broadcasting, but that would waste a lot of memory when m is large.
I also believe that the np.einsum() utilizes nothing but the for loop so might not be that efficient, correct me if I am wrong.
Any thought?
Optimization Stage #1
My first level of optimizations using a direct translation of the loopy code to a broadcasting based one upon introducing a new axis and as such not so memory efficient one, as listed below -
p1 = (-((X[:,None] - Y)**2)/c1)-c2
p11 = p1.sum(2)
p2 = np.exp(p11+c3)
out = np.log(p2.sum(0)).mean()
Optimization Stage #2
Bringing in few optimizations keeping in mind that we intend to separate out the operations on the constants, I ended up with the following -
c10 = -c1
c20 = X.shape[1]*c2
subs = (X[:,None] - Y)**2
p00 = subs.sum(2)
p10 = p00/c10
p11 = p10-c20
p2 = np.exp(p11+c3)
out = np.log(p2.sum(0)).mean()
Optimization Stage #3
Going further with it and and seeing the places where the operations could be optimized, I ended up using Scipy's cdist to replace the heavy-weight work of the squaring and sum-reduction. This should be pretty memory efficient and gave us the final implementation, as shown below -
from scipy.spatial.distance import cdist
# Setup constants
c10 = -c1
c20 = X.shape[1]*c2
c30 = c20-c3
c40 = np.exp(c30)
c50 = np.log(c40)
# Get stagewise operations corresponding to loopy ones
p1 = cdist(X, Y, 'sqeuclidean')
p2 = np.exp(p1/c10).sum(0)
out = np.log(p2).mean() - c50
Runtime test
Approaches -
def loopy_app(X, Y, sigma):
k, d = X.shape
m = Y.shape[0]
c1 = 2.0*sigma**2
c2 = 0.5*np.log(np.pi*c1)
c3 = np.log(1.0/k)
L_B = np.zeros((m,))
for i in xrange(m):
L_B[i] = np.log(np.sum(np.exp(np.sum(-np.divide(
np.power(X-Y[i,:],2), c1)-c2,1)+c3)))
return np.mean(L_B)
def vectorized_app(X, Y, sigma):
# Setup constants
k, d = D_A.shape
c1 = 2.0*sigma**2
c2 = 0.5*np.log(np.pi*c1)
c3 = np.log(1.0/k)
c10 = -c1
c20 = X.shape[1]*c2
c30 = c20-c3
c40 = np.exp(c30)
c50 = np.log(c40)
# Get stagewise operations corresponding to loopy ones
p1 = cdist(X, Y, 'sqeuclidean')
p2 = np.exp(p1/c10).sum(0)
out = np.log(p2).mean() - c50
return out
Timings and verification -
In [294]: # Setup inputs with m(=D_B.shape[0]) being a large number
...: X = np.random.randint(0,9,(100,10))
...: Y = np.random.randint(0,9,(10000,10))
...: sigma = 2.34
...:
In [295]: np.allclose(loopy_app(X, Y, sigma),vectorized_app(X, Y, sigma))
Out[295]: True
In [296]: %timeit loopy_app(X, Y, sigma)
1 loops, best of 3: 225 ms per loop
In [297]: %timeit vectorized_app(X, Y, sigma)
10 loops, best of 3: 23.6 ms per loop
In [298]: # Setup inputs with m(=Y.shape[0]) being a much large number
...: X = np.random.randint(0,9,(100,10))
...: Y = np.random.randint(0,9,(100000,10))
...: sigma = 2.34
...:
In [299]: np.allclose(loopy_app(X, Y, sigma),vectorized_app(X, Y, sigma))
Out[299]: True
In [300]: %timeit loopy_app(X, Y, sigma)
1 loops, best of 3: 2.27 s per loop
In [301]: %timeit vectorized_app(X, Y, sigma)
1 loops, best of 3: 243 ms per loop
Around 10x speedup there!
I have a vector, a, which I wish to cross with every point in a defined 3D space.
import numpy as np
# Grid
x = np.arange(-4,4,0.1)
y = np.arange(-4,4,0.1)
z = np.arange(-4,4,0.1)
a = [1,0,0]
result = [[] for i in range(3)]
for j in range(len(x)): # loop on x coords
for k in range(len(y)): # loop on y coords
for l in range(len(z)): # loop on z coords
r = [x[j] , y[k], z[l]]
result[0].append(np.cross(a, r)[0])
result[1].append(np.cross(a, r)[1])
result[2].append(np.cross(a, r)[2])
This produces an array which has taken the cross product of a with every point in space. However, the process takes far too long, due to the nested loops. Is there anyway to exploit vectors (meshgrid perhaps?) to make this process faster?
Here's one vectorized approach -
np.cross(a, np.array(np.meshgrid(x,y,z)).transpose(2,1,3,0)).reshape(-1,3).T
Sample run -
In [403]: x = np.random.rand(4)
...: y = np.random.rand(5)
...: z = np.random.rand(6)
...:
In [404]: result = original_app(x,y,z,a)
In [405]: out = np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
In [406]: np.allclose(result[0], out[0])
Out[406]: True
In [407]: np.allclose(result[1], out[1])
Out[407]: True
In [408]: np.allclose(result[2], out[2])
Out[408]: True
Runtime test -
# Original setup used in the question
In [393]: # Grid
...: x = np.arange(-4,4,0.1)
...: y = np.arange(-4,4,0.1)
...: z = np.arange(-4,4,0.1)
...:
# Original approach
In [397]: %timeit original_app(x,y,z,a)
1 loops, best of 3: 21.5 s per loop
# #Denziloe's soln
In [395]: %timeit [np.cross(a, r) for r in product(x, y, z)]
1 loops, best of 3: 7.34 s per loop
# Proposed in this post
In [396]: %timeit np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
100 loops, best of 3: 16 ms per loop
More than 1000x speedup over the original one and more than 450x over the loopy approach from other post.
This takes a couple of seconds to run on my machine:
from itertools import product
result = [np.cross(a, r) for r in product(x, y, z)]
I don't know if that's fast enough for you, but there are a lot of calculations involved. It's certainly cleaner, and there is at least some reduction of redundancy (e.g. calculating np.cross(a, r) three times). It also gives the result in a slightly different format, but this is the natural way to store the result and is hopefully fine for your purposes.
I have a code running operations on numpy arrays.
While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays.
In the example below WE3 and T1 are two 1000X1000X1000 arrays.
First I calculate WE3 using a numpy operation, then I sum those arrays.
import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs
The calculation of wE3 takes like 22 seconds, while the addition between WE3 and T1 takes 115 seconds.
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..
Is there a way to speed up that code?
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?
The arrays wE3, T1 and a each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.
Is there a way to speed up that code?
Get more physical memory (i.e. RAM).
If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.
That np.einsum('i,j,k->ijk', wE, wE, wE) part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -
wE[:,None,None] * wE[:,None] * wE
Runtime test -
In [9]: # Setup inputs at 1/5th of original dataset sizes
...: N = 20
...: n = 200
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop
In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop
Next up, we have wE3 + T1, where T1 = np.random.uniform(size = (n,n,n)) doesn't look like could be helped in a big way as we have to create T1 anyway and then it's just element-wise addition. It seems we can use np.add that lets us write back the results to one of the arrays : wE3 or T1. Let's say we choose T1, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.
Thus, we could do -
np.add(wE3,T1,out=T1)
Runtime test -
In [58]: def func1(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: np.add(wE3,T1,out=T1)
...: return T1
...:
In [59]: # Setup inputs at 1/4th of original dataset sizes
...: N = 25
...: n = 250
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...:
In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop
In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop
Using #Aaron's suggestion, we can use a loop and assuming that writing back the results into wE3 is okay, we could do -
wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3:
np.add(x, np.random.uniform(size = (n,n)), out=x)
Final results
Thus, putting back all the suggested improvements, finally the runtime test results were -
In [97]: def func1(wE):
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE):
...: wE3 = wE[:,None,None] * wE[:,None] * wE
...: for x in wE3:
...: np.add(x, np.random.uniform(size = (n,n)), out=x)
...: return wE3
...:
In [98]: # Setup inputs at 1/3rd of original dataset sizes
...: N = 33
...: n = 330
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop
In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop
You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.
All you have to do is throw that above code into a function, and put an #jit decorater on top. It gets speedups close to Cython.
However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems
A numerical integration is taking exponentially longer than I expect it to. I would like to know if the way that I implement the iteration over the mesh could be a contributing factor. My code looks like this:
import numpy as np
import itertools as it
U = np.linspace(0, 2*np.pi)
V = np.linspace(0, np.pi)
for (u, v) in it.product(U,V):
# values = computation on each grid point, does not call any outside functions
# solution = sum(values)
return solution
I left out the computations because they are long and my question is specifically about the way that I have implemented the computation over the parameter space (u, v). I know of alternatives such as numpy.meshgrid; however, these all seem to create instances of (very large) matrices, and I would guess that storing them in memory would slow things down.
Is there an alternative to it.product that would speed up my program, or should I be looking elsewhere for the bottleneck?
Edit: Here is the for loop in question (to see if it can be vectorized).
import random
import numpy as np
import itertools as it
##########################################################################
# Initialize the inputs with random (to save space)
##########################################################################
mat1 = np.array([[random.random() for i in range(3)] for i in range(3)])
mat2 = np.array([[random.random() for i in range(3)] for i in range(3)])
a1, a2, a3 = np.array([random.random() for i in range(3)])
plane_normal = np.array([random.random() for i in range(3)])
plane_point = np.array([random.random() for i in range(3)])
d = np.dot(plane_normal, plane_point)
truthval = True
##########################################################################
# Initialize the loop
##########################################################################
N = 100
U = np.linspace(0, 2*np.pi, N + 1, endpoint = False)
V = np.linspace(0, np.pi, N + 1, endpoint = False)
U = U[1:N+1] V = V[1:N+1]
Vsum = 0
Usum = 0
##########################################################################
# The for loops starts here
##########################################################################
for (u, v) in it.product(U,V):
cart_point = np.array([a1*np.cos(u)*np.sin(v),
a2*np.sin(u)*np.sin(v),
a3*np.cos(v)])
surf_normal = np.array(
[2*x / a**2 for (x, a) in zip(cart_point, [a1,a2,a3])])
differential_area = \
np.sqrt((a1*a2*np.cos(v)*np.sin(v))**2 + \
a3**2*np.sin(v)**4 * \
((a2*np.cos(u))**2 + (a1*np.sin(u))**2)) * \
(np.pi**2 / (2*N**2))
if (np.dot(plane_normal, cart_point) - d > 0) == truthval:
perp_normal = plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Vsum += f*differential_area
else:
perp_normal = - plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Usum += f*differential_area
integral = abs(Vsum) + abs(Usum)
If U.shape == (nu,) and (V.shape == (nv,), then the following arrays vectorize most of your calculations. With numpy you get the best speed by using arrays for the largest dimensions, and looping on the small ones (e.g. 3x3).
Corrected version
A = np.cos(U)[:,None]*np.sin(V)
B = np.sin(U)[:,None]*np.sin(V)
C = np.repeat(np.cos(V)[None,:],U.size,0)
CP = np.dstack([a1*A, a2*B, a3*C])
SN = np.dstack([2*A/a1, 2*B/a2, 2*C/a3])
DA1 = (a1*a2*np.cos(V)*np.sin(V))**2
DA2 = a3*a3*np.sin(V)**4
DA3 = (a2*np.cos(U))**2 + (a1*np.sin(U))**2
DA = DA1 + DA2 * DA3[:,None]
DA = np.sqrt(DA)*(np.pi**2 / (2*Nu*Nv))
D = np.dot(CP, plane_normal)
S = np.sign(D-d)
F1 = np.dot(np.dot(SN, mat2.T), plane_normal)
F = F1 * DA
#F = F * S # apply sign
Vsum = F[S>0].sum()
Usum = F[S<=0].sum()
With the same random values, this produces the same values. On a 100x100 case, it is 10x faster. It's been fun playing with these matrices after a year.
In ipython I did simple sum calculations on your 50 x 50 gridspace
In [31]: sum(u*v for (u,v) in it.product(U,V))
Out[31]: 12337.005501361698
In [33]: UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
Out[33]: 12337.005501361693
In [34]: timeit UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
1000 loops, best of 3: 293 us per loop
In [35]: timeit sum(u*v for (u,v) in it.product(U,V))
100 loops, best of 3: 2.95 ms per loop
In [38]: timeit list(it.product(U,V))
1000 loops, best of 3: 213 us per loop
In [45]: timeit UU,VV = np.meshgrid(U,V); (UU*VV).sum().sum()
10000 loops, best of 3: 70.3 us per loop
# using numpy's own sum is even better
product is slower (by factor 10), not because product itself is slow, but because of the point by point calculation. If you can vectorize your calculations so they use the 2 (50,50) arrays (without any sort of looping) it should speed up the overall time. That's the main reason for using numpy.
[k for k in it.product(U,V)] runs in 2ms for me, and the itertool package is made to be efficient, e.g. it does not create a long array first (http://docs.python.org/2/library/itertools.html).
The culprit seems to be your code inside the iteration, or your using a lot of points in linspace.