Two similar implementations quit dramatic difference times to run - python

I've tried the basic cython tutorial here to see how significant the speed up is.
I've also made two different python implementations which differ quit significantly in runtime. I've tested run times of the differences, and as far as I can see, they do not explain the overall runtime difference.
The code is calculating the first kmax primes:
def pyprimes1(kmax):
p=[]
result = []
if kmax > 1000:
kmax = 1000
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p.append(n)
k = k + 1
result.append(n)
n = n + 1
return result
def pyprimes2(kmax):
p=zeros(kmax)
result = []
if kmax > 1000:
kmax = 1000
p=zeros(kmax)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
return result
As you can see, the only difference between the two implementations is in the usage of the p variable, in the first it is a python list, in the other it is a numpy array. I used IPython %timeit magic to test timinigs. who do you think preformed better? here is what I got:
%timeit pyprimes1(1000)
10 loops, best of 3: 79.4 ms per loop
%timeit pyprimes2(1000)
1 loops, best of 3: 1.14 s per loop
That was strange and surprising, as I thought a numpy array pre-allocated and probably C implemented would be much faster.
I've also test:
array assignment:
%timeit p[100]=5
10000000 loops, best of 3: 116 ns per loop
array selection:
%timeit p[100]
1000000 loops, best of 3: 252 ns per loop
which was twice slower.. also didnt expect that.
array initialization:
%timeit zeros(1000)
1000000 loops, best of 3: 1.65 µs per loop
list appending:
%timeit p.append(1)
10000000 loops, best of 3: 164 ns per loop
list selection:
%timeit p[100]
10000000 loops, best of 3: 56 ns per loop
So it seems list selection is 5 times faster then array selection.
I cant see how this numbers adds-up to the more then x10 time difference. while we do selection in each iteration, it is only 5 times faster.
Would appriciate an explanation regarding the timing differnces bewtween arrays and lists and also the overall time differnce between the two implementations. or am I using %timeit wrong by measuring time on increased length list?
BTW, the cython code did best at 3.5ms.

The 1000th prime number is 7919. So if on average the inner loops iterates kmax/2 times (very roughly), your program performs approx. 7919 * (1000/2) ~ = 4*106 selections from the array/list. If a single selection from a list for the first version takes 56 ns, even the selections wouldn't fit into 79 ms (0.056 µs * 4*106 ~ = 0.22 sec).
Probably these nanosecond times are not very accurate.
By the way, performance of append depends on size of the list. In some cases it can lead to reallocation, but in most the list has enough free space and it's lightning fast.

Numpy's main use case is to perform operations on whole arrays and slices, not single elements. Those operations are implemented in C and therefore much faster than the equivalent Python code. For example,
c = a + b
will be much faster than
for i in xrange(len(a)):
c[i] = a[i] + b[i]
even if the variables are numpy arrays in both cases.
However, single element operations like the ones you are testing may well be worse than Python lists. Python lists are plain C arrays of structs, which are quite simple to access.
On the other hand, accessing an element in a numpy array comes with lots of overhead to support multiple raw data formats and advanced indexing options, among other reasons.

Related

Python Best Practice for Looping Backwards in For Loop without reversing the array

In doing problems on Leetcode etc it's often required to iterate from the end of the array to the front and I'm used to more traditional programming languages where the for loop is less awkward i.e. for(int i = n; i >= 0; i--) where n is the last index of the array but in Python I find that I'm doing something like this for i in range(n,-1,-1) which looks a bit awkward so I just wanted to know if there was something more elegant. I know that I can reverse the array by doing array[::-1] and then loop as per usual with for range but that's not really what I want to do since it adds computational complexity to the problem.
Use reversed which doesn't create a new list, but instead creates a a reverse iterator, and allows you to iterate in reverse:
a = [1, 2, 3, 4, 5]
for n in reversed(a):
print(n)
Just a comparison of three methods.
array = list(range(100000))
def go_by_index():
for i in range(len(array)-1,-1,-1):
array[i]
def revert_array_directly():
for n in array[::-1]:
n
def reversed_fn():
for n in reversed(array):
n
%timeit go_by_index()
%timeit revert_array_directly()
%timeit reversed_fn()
Outputs
100 loops, best of 3: 4.84 ms per loop
100 loops, best of 3: 2.01 ms per loop
1000 loops, best of 3: 1.49 ms per loop
The time difference is visible, but as you may see, the second and the third option is not that different, especially if the array of interest is of small or medium size.

Optimize Python: Large arrays, memory problems

I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?
Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:
For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.
My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..
import numpy as np
p_fine.shape => m x 3
p.shape => n x 3
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])
Cheers!
First of all thanks for the detailed help.
First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.
I also tried my way around sklearn and ended up with
def sklearnSearch_v3(p, p_fine, k):
neigh = NearestNeighbors(k)
neigh.fit(p)
return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)
which ended up being quite fast, for my data sizes, I get the following
import numpy as np
from sklearn.neighbors import NearestNeighbors
m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3
yields
%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop
Approach #1
We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm part and np.argpartition in place of actual sorting with np.argsort, like so -
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k
Approach #2
Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -
from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)
But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine that has millions of rows and use cdist and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.
So, finally we would have an implementation like so -
out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].mean(1)
Runtime test
Setup -
# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5
def original_approach(p,p_fine,m,n,k):
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
(ps-p,axis=1))[:k]])
return data_fine
def proposed_approach(p,p_fine,m,n,k):
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
return out/k
def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
L = len_per_iter
out = np.empty((m,))
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
return out/k
Timings -
In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop
In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop
In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop
In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop
In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop
So, there's about 2x improvement with the first proposed approach and 20x over the original approach with the second one at the sweet spot with the len_per_iter param set at 1000. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!

Third order moment calculation - numpy

In python, I have an array X with N rows (the number of examples) and n columns (the number of features).
If I want to calculate the second order moment matrix C
C[i,j] = E(x_i x_j)
then I have two possibility:
First, do the loop:
for i in range(N):
for j in range(n):
for k in range(n):
C[j,k] = C[j,k] + X[i,j]*X[i,k]/N
Second, more simple, use numpy product matrix:
import numpy np
C = np.transpose(X).dot(X)/N
The second version in practice is extremely faster.
If now I want to calculate the third order moment matrix T
T[i,j,k] = E(x_i x_j x_k)
then the loop alternative is easy:
for i in range(N):
for j in range(n):
for k in range(n):
for m in range(n):
T[j,k,m] = T[j,k,m] + X[i,j]*X[i,k]*X[i,m]/N
Is there a fast way using numpy libraries to calculate this last tensor, like for the second order moment?
You can use NumPy's einsum notation to solve both your second and third order cases.
Second order :
np.einsum('ij,ik->jk',X,X)/N
Third order :
np.einsum('ij,ik,il->jkl',X,X,X)/N
As can be seen, it would be easier/intuitive to extend this to higher order cases with it.
I know it is not perfect in terms of speed, but why not using np.power(x, 3).sum() / N? It is slower than the dot product, but faster than looping.
In [1]: import numpy as np
In [2]: x = np.random.rand(10000)
In [3]: x.dot(x.T)
Out[3]: 3373.6189738897856
In [4]: %timeit(x.dot(x.T))
The slowest run took 48.74 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.39 µs per loop
In [5]: %timeit(np.power(x, 2).sum())
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 140 µs per loop
In [6]: np.power(x, 2).sum()
Out[6]: 3373.6189738897865
Btw, that's how I calculate the moments...

deque.popleft() and list.pop(0). Is there performance difference?

deque.popleft() and list.pop(0) seem to return the same result. Is there any performance difference between them and why?
deque.popleft() is faster than list.pop(0), because the deque has been optimized to do popleft() approximately in O(1), while list.pop(0) takes O(n) (see deque objects).
Comments and code in _collectionsmodule.c for deque and listobject.c for list provide implementation insights to explain the performance differences. Namely that a deque object "is composed of a doubly-linked list", which effectively optimizes appends and pops at both ends, while list objects are not even singly-linked lists but C arrays (of pointers to elements (see Python 2.7 listobject.h#l22 and Python 3.5 listobject.h#l23), which makes them good for fast random access of elements but requires O(n) time to reposition all elements after removal of the first.
For Python 2.7 and 3.5, the URLs of these source code files are:
https://hg.python.org/cpython/file/2.7/Modules/_collectionsmodule.c
https://hg.python.org/cpython/file/2.7/Objects/listobject.c
https://hg.python.org/cpython/file/3.5/Modules/_collectionsmodule.c
https://hg.python.org/cpython/file/3.5/Objects/listobject.c
Using %timeit, the performance difference between deque.popleft() and list.pop(0) is about a factor of 4 when both the deque and the list have the same 52 elements and grows to over a factor of 1000 when their lengths are 10**8. Test results are given below.
import string
from collections import deque
%timeit d = deque(string.letters); d.popleft()
1000000 loops, best of 3: 1.46 µs per loop
%timeit d = deque(string.letters)
1000000 loops, best of 3: 1.4 µs per loop
%timeit l = list(string.letters); l.pop(0)
1000000 loops, best of 3: 1.47 µs per loop
%timeit l = list(string.letters);
1000000 loops, best of 3: 1.22 µs per loop
d = deque(range(100000000))
%timeit d.popleft()
10000000 loops, best of 3: 90.5 ns per loop
l = range(100000000)
%timeit l.pop(0)
10 loops, best of 3: 93.4 ms per loop
Is there performance difference?
Yes. deque.popleft() is O(1) -- a constant time operation. While list.pop(0) is O(n) -- linear time operation: the larger the list the longer it takes.
Why?
CPython list implementation is array-based. pop(0) removes the first item from the list and it requires to shift left len(lst) - 1 items to fill the gap.
deque() implementation uses a doubly linked list. No matter how large the deque, deque.popleft() requires a constant (limited above) number of operations.
Yes, and it's considerable if you have a long list or deque. All elements in a list are placed contiguously in memory, so if you remove any element, all subsequent elements must be shifted one position to the left - therefore, the time required to remove or insert an element at the start of a list is proportional to the length of the list. A deque, on the other hand, is specifically constructed to allow fast insertions or removal at either end (typically by allowing "empty" memory locations at the beginning of the deque, or to wrap around so that the end of the memory segment occupied by the deque can contain elements that are actually considered to be at the beginning of the deque).
Compare the performance of these two snippets:
d = deque([0] * 1000000)
while d:
d.popleft()
if len(d) % 100 == 0:
print(len(d))
lst = [0] * 1000000
while lst:
lst.pop(0)
if len(lst) % 100 == 0:
print(len(lst))

Store Values through Array or Use Function? (to optimize)

In Python, I have a variable x (dependant on another vairable i) that takes following values :
x = 0.5 for i=11 or i=111
x = 1 for 12<=i<=100 or 112<=i<=200
x = 0 for the rest of the values of i (i takes integer values from 1 to 300)
I wish to use value of x many times inside a loop (where i is NOT the iterator). Which would be the better (computation time saving) way to store values of x: Array or function?
I can store it as array of length of 300 and assign above values. Or I can have function get_x that takes value of i as input and gives above values according to if condition.
I want to optimize my code for time. Which way would be better? (I am trying to implement this in Python and MATLAB as well.)
The answer is entirely dependent on your application, and anyway I would tend towards the philosophy of avoiding premature optimization. Probably just implement it in whichever way looks cleanest or makes the most sense to you, and if it ends up being too slow try something different.
But if you really do insist upon seeing real results, let's take a look. Here's the code I used to run this:
import time
import random
def get_x(i):
if i == 11 or i == 111:
return 0.5
if (12 <= i and i <= 100) or (112 <= i and i <= 200):
return 1
return 0
x_array = [get_x(i) for i in range(300)]
i_array = [i % 300 for i in range(1000000)]
print "Sequential access"
start = time.time()
for i in i_array:
x = get_x(i)
end = time.time()
print "get_x method:", end-start
start = time.time()
for i in i_array:
x = x_array[i]
end = time.time()
print "Array method:", end-start
print
random.seed(123)
i_array = [random.randint(0,299) for i in range(1000000)]
print "Random access"
start = time.time()
for i in i_array:
x = get_x(i)
end = time.time()
print "get_x method:", end-start
start = time.time()
for i in i_array:
x = x_array[i]
end = time.time()
print "Array method:", end-start
Here's what it prints out:
Sequential access
get_x method: 0.264999866486
Array method: 0.108999967575
Random access
get_x method: 0.263000011444
Array method: 0.108999967575
Overall neither method is very slow, this is for 10^6 accesses and they both easily complete within a quarter of a second. The get_x method does appear to be about twice as slow as the array method. However, this will not be the slow part of your loop logic! Anything else you put in that loop will certainly be the main cause of your program's execution time. You should definitely choose the method makes your code easier to maintain, which is probably the get_x method.
I would prefer the function, maybe it could be a little slower since the function creates a new scope in the stack of memory but it will be more readable, and as the Python's Zen says: "Simple is better than complex."
Precomputing x and then doing lookup will be faster for each lookup but will pay for itself only if enough lookups are done to outweigh the cost of precomputing all the values each time the program is run. The break-even point can be computed based on benchmarks, but perhaps not worth the effort. Another strategy is to not do precomputation but for every computation memoize or cache its results. Some caching strategies such as reddis and memcache allow persistance between program runs.
Based on testing with timeit, list access is between 3.5-7.3 times faster than computation depending on the value of i. Below are some test results.
def f(i):
if i == 11 or i == 111:
return .5
else:
if i >= 12 and i <= 100 or i >= 112 and i <= 200:
return 1
return 0
timeit f(11)
10000000 loops, best of 3: 139 ns per loop
timeit f(24)
1000000 loops, best of 3: 215 ns per loop
timeit f(150)
1000000 loops, best of 3: 249 ns per loop
timeit f(105)
1000000 loops, best of 3: 267 ns per loop
timeit f(237)
1000000 loops, best of 3: 289 ns per loop
x = range(300)
timeit x[150]
10000000 loops, best of 3: 39.5 ns per loop
timeit x[1]
10000000 loops, best of 3: 39.7 ns per loop
timeit x[299]
10000000 loops, best of 3: 39.7 ns per loop

Categories