Parallelizing using cython

Parallelizing using cython - python

Is there a way the code below can be parallelized? I looked into cyton's prange, but couldn't figure out how it works. Does the prange parallelize the internal loops on different cores? For the code below how can I parallelize it?
#cython.boundscheck(False)
def gs_iterate_once(double[:,:] doc_topic,
double[:,:] topic_word,
double[:] topic_distribution,
double[:] topic_probabilities,
unsigned int[:,:] doc_word_topic,
int num_topics):
cdef unsigned int doc_id
cdef unsigned int word_id
cdef unsigned int topic_id
cdef unsigned int new_topic
for i in xrange(doc_word_topic.shape[0]):
doc_id = doc_word_topic[i, 0]
word_id = doc_word_topic[i, 1]
topic_id = doc_word_topic[i, 2]
doc_topic[doc_id, topic_id] -= 1
topic_word[topic_id, word_id] -= 1
topic_distribution[topic_id] -= 1
for j in xrange(num_topics):
topic_probabilities[j] = (doc_topic[doc_id, j] * topic_word[j, word_id]) / topic_distribution[j]
new_topic = draw_topic(np.asarray(topic_probabilities))
doc_topic[doc_id, new_topic] += 1
topic_word[new_topic, word_id] += 1
topic_distribution[new_topic] += 1
# Set the new topic
doc_word_topic[i, 2] = new_topic

prange uses OpenMP that is indeed shared-memory parallelism. So, on a single computer it will create threads that run on the different cores available, with access to the same pool of memory.
For the routine that you show, the first step is to understand what part can be parallelized. Typically, with data using as first index i, operating only on element i and not, say, i-1 or i+1, makes the problem parallelizable. This is not the case here, so you need to find a way to make the computation more independent.
Actually finding the specific parallel pattern is beyond a SO answer but I'll mention a few tips:
What is inside the prange must be all cythonized. Python calls are not possible in a thread. + suggestion by #DavidW: Python calls are possible when part of a with gil block.
A typical advice here is to check, once your code has been made independent of the loop ordering, wheter your results are the same when running the index from n-1 to 0 instead of from 0 to n-1
A few commented and illustrative examples: https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html Cython prange slower for 4 threads then with range http://nealhughes.net/parallelcomp2/ http://www.perrygeo.com/parallelizing-numpy-array-loops-with-cython-and-mpi.html

#PierredeBuyl's answer gives a good outline of what prange does and how to use it.
This is more a few specific comments relating to your code:
You can't parallelize the outer loop:
doc_topic[doc_id, topic_id] -= 1
and the similar ones for other variables and for +=1. These modify a variable that is shared between all the loops, and are going to cause inconsistent results.
A similar problem exists with topic_probabilities[j] = ... if you're parallelizing the outer loop.
You could easily parallelize the inner loop for j in xrange(num_topics): - this only modifies stuff that depends on the index 'j' so there's no issue with the threads fighting to modify the same data. (However, there's a performance cost each time you start a multithreaded region, so you usually try to parallelize the outer loop instead to avoid this - depending on the size of the arrays you may not gain much)

Related

Is it computationally more intensive to assign value to a variable or repeat a simple operation?

Now i may be nitpicking here but i wanted to know which is computationally more efficient over a large number of iterations (assuming no restrictions on memory) purely in terms of time taken to run the overall programme.
i am asking mainly in terms of two languages python and c, since these are the two i use most.
for e.g. in c, something like:
int count, sum = 0, 0;
while (count < 99999){
if (((pow(count, 2))/10)%10 == 4) {// just some random operation
sum += pow(count, 2);
}
count++;
}
or
int count, sum, temp = 0, 0, 0;
while (count < 99999){
temp = pow(count, 2)
if (((temp/10)%10 == 4) {// just some random operation
sum += temp;
}
count++;
}
In python something like
for i in range (99999):
n = len(str(i))
print "length of string", str(i), "is", n
or
for i in range (99999):
temp = str(i)
n = len(temp)
print "length of string:, temp, "is", n
Now these are just random operations i thought on the fly, my main question is whether it is better create a new variable or just repeat the operation. i know the computing power required will change as i, count increases but just generalizing for a large number of iterations which is better.

I have tested both your codes in python with timeit, and the result was 1.0221271729969885 seconds for the first option, 1.0154028950055363 for the second option
I have done this test only with one iteration on each example and as you can see, they are extremely close to each other, too close to be reliable after only one iteration. If you want to learn more though, I suggest you do these tests yourself with timeit
However, the action you took here is to replace a variable assignment to temp with 2 calls for str(i), not one, and it can of course vary wildly for other functions that are more or less complicated and time consuming than str is

Caching does improve performance and readability of the code most of the time. There may be corner cases for some super fast operations, however I'd say the type of code used in second examples would always be better.
Cached value will most probably be stored in register, hence the one mov will be involved. According to this document there are not so many cheaper operations.
As for readability consider bigger function calls with more arguments, while reviewing code it will require more brainwork to check that calls are the same, moreover if function changes it may introduce bugs where one call has been modified and the other one has been forgotten (true mostly for Python).
All in all it should be readability driving your choices in such cases rather than performance gains. Unless profiler points at this piece of code.

Local parallel computations for a summing operation

I've started messing around with parallel programming and cython/openmp, and I have a simple program that sums over an array using prange:
import numpy as np
from cython.parallel import prange
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def parallel_summation(double[:] vec):
cdef int n = vec.shape[0]
cdef double total
cdef int i
for i in prange(n, nogil=True):
total += vec[i]
return total
It seems to work OK with a setup.py file. However, I was wondering if it is possible to adjust this function and have a little more control over what the processors are doing.
Let's say I have 4 processors: I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum. From the cython documentation, I wasn't able to gather whether something like this is possible or not (the documentation is a little sparse).
I'd appreciate if someone could explain if/how something like this is done using cython/openmp, or maybe help locate some relevant examples (its been surprisingly hard to find simple ones online).

I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum.
That's exactly what's happening here already. Cython infers from your inplace operation that you want to do a reduction. OpenMP will implement a parallel loop with private (zero initialized) copies of the total variable and add them all to total at the end of the loop.
In the generated C, this looks like this:
#pragma omp parallel
{
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) reduction(+:__pyx_v_total)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
__pyx_t_4 = __pyx_v_i;
__pyx_v_total = (__pyx_v_total + (*((double *) ( /* dim=0 */ (__pyx_v_vec.data + __pyx_t_4 * __pyx_v_vec.strides[0]) ))));
}
}
}
You just need to enable OpenMP as described here.
The one thing that you should change in your code, is to initialize total = 0, otherwise it's just an unitialized C variable with may contain garbage.

Is python smart enough to replace function calls with constant result?

Coming from the beautiful world of c, I am trying understand this behavior:
In [1]: dataset = sqlContext.read.parquet('indir')
In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect()
In [3]: for item in sizes:
...: if(item == min(sizes)):
...: count = count + 1
...:
would not even finish after 20 minutes, and I know that the list sizes is not that big, less than 205k in length. However this executed instantly:
In [8]: min_item = min(sizes)
In [9]: for item in sizes:
if(item == min_item):
count = count + 1
...:
So what happened?
My guess: python could not understand that min(sizes) will be always constant, thus replace after the first few calls with its return value..since Python uses the interpreter..
Ref of min() doesn't say anything that would explain the matter to me, but what I was thinking is that it may be that it needs to look at the partitions to do that, but that shouldn't be the case, since sizes is a list, not an RDD!
Edit:
Here is the source of my confusion, I wrote a similar program in C:
for(i = 0; i < SIZE; ++i)
if(i == mymin(array, SIZE))
++count;
and got these timings:
C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 98.679177000 seconds wall clock time.
C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 0.000000000 seconds wall clock time.
and for timings, I used Nomimal Animal's approach from my Time measurements.

I'm by no means an expert on the inner workings of python, but from my understanding thus far you'd like to compare the speed of
for item in sizes:
if(item == min(sizes)):
count = count + 1
and
min_item = min(sizes)
for item in sizes:
if(item == min_item):
count = count + 1
Now someone correct me if I have any of this wrong but,
In python lists are mutable and do not have a fixed length, and are treated as such, while in C an array has a fixed size. From this question:
Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your array time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays.
Now take this example
for item in sizes:
if(item == min(sizes)):
new_item = item - 1
sizes.append(new_item)
Then the value of item == min(sizes) would be different on the next iteration. Python doesn't cache the resulting value of min(sizes) since it would break the above example, or require some logic to check if the list has been changed. Instead it leaves that up to you. By defining min_item = min(sizes) you are essentially caching the result yourself.
Now since the array is a fixed size in C, it can find the min value with less overhead than a python list, thus why I think it has no problem in C (as well as C being a much lower level language).
Again, I don't fully understand the underlying code and compilation for python, and I'm certain if you analyzed the process of the loops in python, you'd see python repeatedly computing min(sizes), causing the extreme amount of lag. I'd love to learn more about the inner workings of python (for example, are any methods cached in a loop for python, or is everything computed again for each iteration?) so if anyone has more info and/or corrections, let me know!

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.

If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.

Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Parralelized code runs much slower in Python than in Matlab

I have a piece of code that does the following:
for each file (already read in the RAM):
call a function and obtain a result
add the results up and disply
Each file can be analyzed in parallel. The function that analyzes each file is as following:
# Complexity = 1000*19*19 units of work
def fun(args):
(a, b, p) = args
for itr in range(1000):
for i in range(19):
for j in range(19):
# The following random number generated depends on
# latest values in (i-1, j), (i+1, j), (i, j-1) & (i, j+1)
# cells of latest a and b arrays
u = np.random.rand();
if (u < p):
a[i, j] += -1
else:
b[i, j] += 1
return a+b
I am using multiprocessing package to achieve parallelism:
import numpy as np
import time
from multiprocessing import Pool, cpu_count
if __name__ == '__main__':
t = time.time()
pool = Pool(processes=cpu_count())
args = [None]*100
for i in range(100):
a = np.random.randint(2, size=(19, 19))
b = np.random.randint(2, size=(19, 19))
p = np.random.rand()
args[i] = (a, b, p)
result = pool.map(fun, args)
for i in range(2, 100):
result[0] += result[i]
print result[0]
print time.time() - t
I have written equivalent MATLAB code that uses parfor and calls fun in each iteration of parfor:
tic
args = cell(100, 1);
r = cell(100, 1);
parfor i = 1:100
a = randi(2, 19, 19);
b = randi(2, 19, 19);
p = rand();
args{i}.a = a;
args{i}.b = b;
args{i}.p = p;
r{i} = fun(args{i});
end
for i = 2:100
r{1} = r{1} + r{i};
end
disp(r{1});
toc
The implementation of fun is as follows:
function [ ret ] = fun( args )
a = args.a;
b = args.b;
p = args.p;
for itr = 1:1000
for i = 1:19
for j = 1:19
u = rand();
if (u < p)
a(i, j) = a(i, j) + -1;
else
b(i, j) = b(i, j) + 1;
end
end
end
end
ret = a + b;
end
I find that MATLAB is blazingly fast, it takes around 1.5 seconds on a dual core processor compared to around 33-34 seconds taken by Python program. Why is this so?
EDIT: A lot of answers suggested that I should vectorize the random number generation. Actually the way it works is, random number generated depends on the latest a and b 2D arrays. I just put a simple rand() call to keep program simple and readable. In actuality of my program, random number is always generated by looking at certain horizontally and vertically neighbouring cells of (i, j) cell. So it is not possible to vectorize that.

Have you benchmarked both implementations of fun in a non-parallel context? One might just be a lot faster. In particular, those nested loops in the Python fun look like they might turn in to a faster vectorized solution in Matlab, or may be optimized by Matlab's JIT.
Stick both implementations in profilers to see where they're spending their time. Convert both implementations to non-parallel and profile those first, to make sure they're equivalent in performance before introducing the complexities of the parallelization stuff.
And one last check - you're setting up Matlab's Parallel Computation Toolbox with a local pool of workers, right, and not hooking in to a remote machine or picking up some other resources? How many workers on the Matlab side?

I did some tests on your Python code, though without the multiprocessing part and achieved a roughly 25x speedup by making the following changes:
Used Python lists instead of a NumPy array, because the latter is really slow when you need to do a lot of indexing. My timings are including the time it takes to do ndarray.tolist() so this might actually be a viable option, as long as the array's are not to big I think.
Ran it in PyPy instead of the regular Python interpreter, because PyPy has a JIT-compiler, to make the comparison with MATLAB a bit more fair. Regular CPython doesn't have such a feature.
Made the "random" call local to the function, i.e. do rand = np.random.rand and later u = rand(), because in Python lookups in the local namespace are faster, which can be significant in tight loops such as this.
Used Python's random.random instead of np.random.rand (also bound to a name local to the function).
Exchanged range for the generator-based xrange.
(This list is sorted from most significant speedup to only very small gain)
Of course there's also the parallel computing aspect. With multiprocessing all the Python objects that are passed around between processes are "pickled", meaning they have to be serialized and de-serialized on top of being copied between processes. MATLAB also copies data between processes (or threads?), but probably does so in a less wasteful way. Next to that, setting up a multiprocessing.Pool also takes a short while, which may not be fair to your MATLAB benchmark, but I'm not sure about this.
Based on your timings and mine, I would say that Python and MATLAB could be about equally fast for this specific task. But, unfortunately you have to jump through some hoops to get that speed with Python. Maybe using the #autojit capability of Numba could be interesting in this regard, if you have that available.

Try this version of fun and see if it gives you a speedup.
def fun(args):
a, b, p = args
n = 1000
u = np.random.random((n, 19, 19))
msk = u < p
msk_sum = msk.sum(0)
a -= msk_sum
b += (n - msk_sum)
return a + b
This is the more efficient way of implement this kind of function using numpy.
These kinds of nested loops can have quite high overhead in interpreted languages like matlab and python, but I suspect the JIT is compensating, at least partially, in matlab so the performance difference between the vectorized and loop implementations will be smaller. Cpython does not currently have any optimizations for these kinds of loops (that I know of) but but at least one python implementation, pypy, does have a JIT. Unfortunately pypy currently only has limited numpy support.
Update:
It seems like you have an iterative algorithm, and at least in my experience, those are the hardest to optimize with numpy/cpython. Consider using cython, this tutorial might also be useful, to write your nested loops. Other people might have other suggestions, but that's the best I can think of.

Given the available information, you would probably benefit from using Cython, which let's you also use some parallelism. Depending on the random numbers you need you could for instance use GSL to generate them.
Original question
There's no need to use multiprocessing, because fun can be easily vectorized, which yields a massive speed up (over 50x).
I'm not sure why matlab is not hit so hard as numpy, but its JIT may be saving it. Python does not like doing two . look ups in the inner loop, nor calling expensive functions there.
def fun_fast(args):
a, b, p = args
for i in xrange(19):
for j in xrange(19):
u = np.random.rand(1000)
msk = u < p
msk_sum = msk.sum()
a[i, j] -= msk_sum
b[i, j] += msk.size - msk_sum
return a + b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.