Optimize Python: Large arrays, memory problems - python

I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?
Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:
For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.
My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..
import numpy as np
p_fine.shape => m x 3
p.shape => n x 3
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])
Cheers!

First of all thanks for the detailed help.
First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.
I also tried my way around sklearn and ended up with
def sklearnSearch_v3(p, p_fine, k):
neigh = NearestNeighbors(k)
neigh.fit(p)
return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)
which ended up being quite fast, for my data sizes, I get the following
import numpy as np
from sklearn.neighbors import NearestNeighbors
m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3
yields
%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop

Approach #1
We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm part and np.argpartition in place of actual sorting with np.argsort, like so -
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k
Approach #2
Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -
from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)
But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine that has millions of rows and use cdist and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.
So, finally we would have an implementation like so -
out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].mean(1)
Runtime test
Setup -
# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5
def original_approach(p,p_fine,m,n,k):
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
(ps-p,axis=1))[:k]])
return data_fine
def proposed_approach(p,p_fine,m,n,k):
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
return out/k
def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
L = len_per_iter
out = np.empty((m,))
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
return out/k
Timings -
In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop
In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop
In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop
In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop
In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop
So, there's about 2x improvement with the first proposed approach and 20x over the original approach with the second one at the sweet spot with the len_per_iter param set at 1000. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!

Related

Numpy searchsorted along many dimensions? [duplicate]

Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop

Third order moment calculation - numpy

In python, I have an array X with N rows (the number of examples) and n columns (the number of features).
If I want to calculate the second order moment matrix C
C[i,j] = E(x_i x_j)
then I have two possibility:
First, do the loop:
for i in range(N):
for j in range(n):
for k in range(n):
C[j,k] = C[j,k] + X[i,j]*X[i,k]/N
Second, more simple, use numpy product matrix:
import numpy np
C = np.transpose(X).dot(X)/N
The second version in practice is extremely faster.
If now I want to calculate the third order moment matrix T
T[i,j,k] = E(x_i x_j x_k)
then the loop alternative is easy:
for i in range(N):
for j in range(n):
for k in range(n):
for m in range(n):
T[j,k,m] = T[j,k,m] + X[i,j]*X[i,k]*X[i,m]/N
Is there a fast way using numpy libraries to calculate this last tensor, like for the second order moment?
You can use NumPy's einsum notation to solve both your second and third order cases.
Second order :
np.einsum('ij,ik->jk',X,X)/N
Third order :
np.einsum('ij,ik,il->jkl',X,X,X)/N
As can be seen, it would be easier/intuitive to extend this to higher order cases with it.
I know it is not perfect in terms of speed, but why not using np.power(x, 3).sum() / N? It is slower than the dot product, but faster than looping.
In [1]: import numpy as np
In [2]: x = np.random.rand(10000)
In [3]: x.dot(x.T)
Out[3]: 3373.6189738897856
In [4]: %timeit(x.dot(x.T))
The slowest run took 48.74 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.39 µs per loop
In [5]: %timeit(np.power(x, 2).sum())
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 140 µs per loop
In [6]: np.power(x, 2).sum()
Out[6]: 3373.6189738897865
Btw, that's how I calculate the moments...

Vectorize mask of squared euclidean distance in Python

I'm running code to generate a mask of locations in B closer than some distance D to locations in A.
N = [[0 for j in range(length_B)] for i in range(length_A)]
dSquared = D*D
for i in range(length_A):
for j in range(length_B):
if ((A[j][0]-B[i][0])**2 + (A[j][1]-B[i][1])**2) <= dSquared:
N[i][j] = 1
For lists of A and B that are tens of thousands of locations long, this code takes a while. I'm pretty sure there's a way to vectorize this though to make it run much faster. Thank you.
It's easier to visualize this code with 2d array indexing:
for j in range(length_A):
for i in range(length_B):
dist = (A[j,0] - B[i,0])**2 + (A[j,1] - B[i,1])**2
if dist <= dSquared:
N[i, j] = 1
That dist expression looks like
((A[j,:] - B[i,:])**2).sum(axis=1)
With 2 elements this array expression might not be faster, but it should help us rethink the problem.
We can perform the i,j, outter problems with broadcasting
A[:,None,:] - B[None,:,:] # 3d difference array
dist=((A[:,None,:] - B[None,:,:])**2).sum(axis=-1) # (lengthA,lengthB) array
Compare this to dSquared, and use the resulting boolean array as a mask for setting elements of N to 1:
N = np.zeros((lengthA,lengthB))
N[dist <= dSquared] = 1
I haven't tested this code, so there may be bugs, but I think basic idea is there. And may be enough of the thought process to let you work out the details for other cases.
You can use scipy's cdist that is supposedly pretty efficient for such distance calculations, like so -
from scipy.spatial.distance import cdist
N = (cdist(A,B,'sqeuclidean') <= dSquared).astype(int)
As suggested in #hpaulj's solution, one can use also use broadcasting. Now, from the posted code in the question, it looks like we are dealing with Nx2 shaped arrays. So, we can basically slice the first and second columns and perform broadcasted subtractions on them separately. The benefit would be that we won't be going 3D and as such keeping it memory efficient and that might also translate to performance boost. Thus, the squared euclidean distances would be calculated like so -
sq_eucl_dist = (A[:,None,0] - B[:,0])**2 + (A[:,None,1] - B[:,1])**2
Let's time all these three approaches for squared euclidean distance calculations.
Runtime test -
In [75]: # Input arrays
...: A = np.random.rand(200,2)
...: B = np.random.rand(200,2)
...:
In [76]: %timeit ((A[:,None,:] - B[None,:,:])**2).sum(axis=-1) # #hpaulj's solution
1000 loops, best of 3: 1.9 ms per loop
In [77]: %timeit (A[:,None,0] - B[:,0])**2 + (A[:,None,1] - B[:,1])**2
1000 loops, best of 3: 401 µs per loop
In [78]: %timeit cdist(A,B,'sqeuclidean')
1000 loops, best of 3: 249 µs per loop
I second the suggestions to use Numpy above. The looping code is also doing a lot more indexing into A than it needs to. You could use something like:
import numpy as np
dimension = 10000
A = np.random.rand(dimension, 2) + 0.0
B = np.random.rand(dimension, 2) + 1.0
N = []
d = 1.0
for i in range(len(A)):
distances = np.linalg.norm(B - A[i,:], axis=1)
for j in range(len(distances)):
if distances[j] <= d:
N.append((i,j))
print(len(N))
It is going to be pretty hard to get decent performance for this out of pure Python. I would also point out that the more-dimensional array solutions will require a... lot... of memory.
Insofar as your matrix N is likely to be sparse, scipy.spatial.cKDTree will give much better time complexity than any approach based on computing all distances brute force:
cKDTree(A).sparse_distance_matrix(cKDTree(B), max_distance=D)

Increasing value of top k elements in sparse matrix

I am trying to find an efficient way that lets me increase the top k values of a sparse matrix by some constant value. I am currently using the following code, which is quite slow for very large matrices:
a = csr_matrix((2,2)) #just some sample data
a[1,1] = 3.
a[0,1] = 2.
y = a.tocoo()
idx = y.data.argsort()[::-1][:1] #k is 1
for i, j in izip(y.row[idx], y.col[idx]):
a[i,j] += 1
Actually the sorting seems to be fast, the problem lies with my final loop where I increase the values by indexing via the sorted indices. Hope someone has an idea of how to speed this up.
You could probably speed things up quite a lot by directly modifying a.data rather than iterating over row/column indices and modifying individual elements:
idx = a.data.argsort()[::-1][:1] #k is 1
a.data[idx] += 1
This also saves converting from CSR --> COO.
Update
As #WarrenWeckesser rightly points out, since you're only interested in the indices of the k largest elements and you don't care about their order, you can use argpartition rather than argsort. This can be quite a lot faster when a.data is large.
For example:
from scipy import sparse
# a random sparse array with 1 million non-zero elements
a = sparse.rand(10000, 10000, density=0.01, format='csr')
# find the indices of the 100 largest non-zero elements
k = 100
# using argsort:
%timeit a.data.argsort()[-k:]
# 10 loops, best of 3: 135 ms per loop
# using argpartition:
%timeit a.data.argpartition(-k)[-k:]
# 100 loops, best of 3: 13 ms per loop
# test correctness:
np.all(a.data[a.data.argsort()[-k:]] ==
np.sort(a.data[a.data.argpartition(-k)[-k:]]))
# True

Two similar implementations quit dramatic difference times to run

I've tried the basic cython tutorial here to see how significant the speed up is.
I've also made two different python implementations which differ quit significantly in runtime. I've tested run times of the differences, and as far as I can see, they do not explain the overall runtime difference.
The code is calculating the first kmax primes:
def pyprimes1(kmax):
p=[]
result = []
if kmax > 1000:
kmax = 1000
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p.append(n)
k = k + 1
result.append(n)
n = n + 1
return result
def pyprimes2(kmax):
p=zeros(kmax)
result = []
if kmax > 1000:
kmax = 1000
p=zeros(kmax)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
return result
As you can see, the only difference between the two implementations is in the usage of the p variable, in the first it is a python list, in the other it is a numpy array. I used IPython %timeit magic to test timinigs. who do you think preformed better? here is what I got:
%timeit pyprimes1(1000)
10 loops, best of 3: 79.4 ms per loop
%timeit pyprimes2(1000)
1 loops, best of 3: 1.14 s per loop
That was strange and surprising, as I thought a numpy array pre-allocated and probably C implemented would be much faster.
I've also test:
array assignment:
%timeit p[100]=5
10000000 loops, best of 3: 116 ns per loop
array selection:
%timeit p[100]
1000000 loops, best of 3: 252 ns per loop
which was twice slower.. also didnt expect that.
array initialization:
%timeit zeros(1000)
1000000 loops, best of 3: 1.65 µs per loop
list appending:
%timeit p.append(1)
10000000 loops, best of 3: 164 ns per loop
list selection:
%timeit p[100]
10000000 loops, best of 3: 56 ns per loop
So it seems list selection is 5 times faster then array selection.
I cant see how this numbers adds-up to the more then x10 time difference. while we do selection in each iteration, it is only 5 times faster.
Would appriciate an explanation regarding the timing differnces bewtween arrays and lists and also the overall time differnce between the two implementations. or am I using %timeit wrong by measuring time on increased length list?
BTW, the cython code did best at 3.5ms.
The 1000th prime number is 7919. So if on average the inner loops iterates kmax/2 times (very roughly), your program performs approx. 7919 * (1000/2) ~ = 4*106 selections from the array/list. If a single selection from a list for the first version takes 56 ns, even the selections wouldn't fit into 79 ms (0.056 µs * 4*106 ~ = 0.22 sec).
Probably these nanosecond times are not very accurate.
By the way, performance of append depends on size of the list. In some cases it can lead to reallocation, but in most the list has enough free space and it's lightning fast.
Numpy's main use case is to perform operations on whole arrays and slices, not single elements. Those operations are implemented in C and therefore much faster than the equivalent Python code. For example,
c = a + b
will be much faster than
for i in xrange(len(a)):
c[i] = a[i] + b[i]
even if the variables are numpy arrays in both cases.
However, single element operations like the ones you are testing may well be worse than Python lists. Python lists are plain C arrays of structs, which are quite simple to access.
On the other hand, accessing an element in a numpy array comes with lots of overhead to support multiple raw data formats and advanced indexing options, among other reasons.

Categories