Local parallel computations for a summing operation - python

I've started messing around with parallel programming and cython/openmp, and I have a simple program that sums over an array using prange:
import numpy as np
from cython.parallel import prange
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def parallel_summation(double[:] vec):
cdef int n = vec.shape[0]
cdef double total
cdef int i
for i in prange(n, nogil=True):
total += vec[i]
return total
It seems to work OK with a setup.py file. However, I was wondering if it is possible to adjust this function and have a little more control over what the processors are doing.
Let's say I have 4 processors: I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum. From the cython documentation, I wasn't able to gather whether something like this is possible or not (the documentation is a little sparse).
I'd appreciate if someone could explain if/how something like this is done using cython/openmp, or maybe help locate some relevant examples (its been surprisingly hard to find simple ones online).

I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum.
That's exactly what's happening here already. Cython infers from your inplace operation that you want to do a reduction. OpenMP will implement a parallel loop with private (zero initialized) copies of the total variable and add them all to total at the end of the loop.
In the generated C, this looks like this:
#pragma omp parallel
{
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) reduction(+:__pyx_v_total)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
__pyx_t_4 = __pyx_v_i;
__pyx_v_total = (__pyx_v_total + (*((double *) ( /* dim=0 */ (__pyx_v_vec.data + __pyx_t_4 * __pyx_v_vec.strides[0]) ))));
}
}
}
You just need to enable OpenMP as described here.
The one thing that you should change in your code, is to initialize total = 0, otherwise it's just an unitialized C variable with may contain garbage.

Related

How to produce random number in a efficient way with c++

I was using boost and QuantLib to produce a 'Array' containing random numbers (standard normal distributed). However, I noticed that the computation performance was not very desirable, and the speed was much slower than simply using numpy of python. Can anyone give me some suggestions?
Many thanks.
Here is my c++ code:
using namespace QuantLib;
Array generateRandNumbers(unsigned long seed, Size n) {
Array res(n);
boost::mt19937 rnd(seed);
boost::normal_distribution<> normDist(0, 1);
boost::variate_generator<boost::mt19937&, boost::normal_distribution<>> generator_norm(rnd, normDist);
BOOST_FOREACH(Real& x, res) x = generator_norm();
return res;
}
int main()
{
unsigned long seed = 1;
Size n = 1e6;
boost::timer timer;
Array randNumbers = generateRandNumbers(seed, n);
std::cout << timer.elapsed() << std::endl;
return 0
}
And this is my python code:
import numpy as np
import time
ts = time.time()
res = np.random.normal(0, 1, 1000000);
print(time.time() - ts)
You asked:
Can anyone give me some suggestions?
The programing language isn't likely to be play a major factor here. Therefore this questions comes down to other parameters:
Random number algorithm used (eg. Mersenne Twister)
Implementation details of algorithm that may affect perfomance
How the code is compiled, linked and run. (release, debugging session, compiler optimizations)
As for the last point - make sure that when making comparison you always use an optimized release build.

Parallelizing using cython

Is there a way the code below can be parallelized? I looked into cyton's prange, but couldn't figure out how it works. Does the prange parallelize the internal loops on different cores? For the code below how can I parallelize it?
#cython.boundscheck(False)
def gs_iterate_once(double[:,:] doc_topic,
double[:,:] topic_word,
double[:] topic_distribution,
double[:] topic_probabilities,
unsigned int[:,:] doc_word_topic,
int num_topics):
cdef unsigned int doc_id
cdef unsigned int word_id
cdef unsigned int topic_id
cdef unsigned int new_topic
for i in xrange(doc_word_topic.shape[0]):
doc_id = doc_word_topic[i, 0]
word_id = doc_word_topic[i, 1]
topic_id = doc_word_topic[i, 2]
doc_topic[doc_id, topic_id] -= 1
topic_word[topic_id, word_id] -= 1
topic_distribution[topic_id] -= 1
for j in xrange(num_topics):
topic_probabilities[j] = (doc_topic[doc_id, j] * topic_word[j, word_id]) / topic_distribution[j]
new_topic = draw_topic(np.asarray(topic_probabilities))
doc_topic[doc_id, new_topic] += 1
topic_word[new_topic, word_id] += 1
topic_distribution[new_topic] += 1
# Set the new topic
doc_word_topic[i, 2] = new_topic
prange uses OpenMP that is indeed shared-memory parallelism. So, on a single computer it will create threads that run on the different cores available, with access to the same pool of memory.
For the routine that you show, the first step is to understand what part can be parallelized. Typically, with data using as first index i, operating only on element i and not, say, i-1 or i+1, makes the problem parallelizable. This is not the case here, so you need to find a way to make the computation more independent.
Actually finding the specific parallel pattern is beyond a SO answer but I'll mention a few tips:
What is inside the prange must be all cythonized. Python calls are not possible in a thread. + suggestion by #DavidW: Python calls are possible when part of a with gil block.
A typical advice here is to check, once your code has been made independent of the loop ordering, wheter your results are the same when running the index from n-1 to 0 instead of from 0 to n-1
A few commented and illustrative examples: https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html Cython prange slower for 4 threads then with range http://nealhughes.net/parallelcomp2/ http://www.perrygeo.com/parallelizing-numpy-array-loops-with-cython-and-mpi.html
#PierredeBuyl's answer gives a good outline of what prange does and how to use it.
This is more a few specific comments relating to your code:
You can't parallelize the outer loop:
doc_topic[doc_id, topic_id] -= 1
and the similar ones for other variables and for +=1. These modify a variable that is shared between all the loops, and are going to cause inconsistent results.
A similar problem exists with topic_probabilities[j] = ... if you're parallelizing the outer loop.
You could easily parallelize the inner loop for j in xrange(num_topics): - this only modifies stuff that depends on the index 'j' so there's no issue with the threads fighting to modify the same data. (However, there's a performance cost each time you start a multithreaded region, so you usually try to parallelize the outer loop instead to avoid this - depending on the size of the arrays you may not gain much)

Converting Algorithm from Python to C: Suggestions for Using bin() in C?

So essentially, I have a homework problem to write in c, and instead of taking the easy route, I thought that I would implement a little algorithm and some coding practice to impress my Professor. The question is to help us to pick up C (or review it, the former is for me), and the question tells us to return all of the integers that divide a given integer (such that there is no remainder).
What I did in python was to create a is_prime() method, a pool_of_primes() method, and a combinations() method. So far, I have everything done in C, up to the combinations() method. The problem that I am running into now is some syntax errors (i.e. not being able to alter a string by declaration) and mainly the binary that I was using for the purpose of what would be included in my list of combinations. But without being able to alter my string by declaration, the Python code is kind of broken...
Here is the python code:
def combinations(aList):
'''
The idea is to provide a list of ints and combinations will provide
all of the combinations of that list using binary.
To track the combinations, we use a string representation of binary
and count down from there. Each spot in the binary represents an
on/off (included/excluded) indicator for the numbers.
'''
length = len(aList) #Have this figured out
s = ""
canidates = 0
nList = []
if (length >=21):
print("\nTo many possible canidates for integers that divide our number.\n")
return False
for i in range(0,length):
s += "1"
canidates += pow(2,i)
#We now have a string for on/off switch of the elements in our
#new list. Canidates is the size of the new list.
nList.append(1)
while (canidates != 0):
x = 1
for i in range(0,length):
if(int(s[i]) == 1):
x = x*aList[i]
nList.append(x)
canidates -= 1
s = ''
temp = bin(canidates)
for i in range(2,len(temp)):
s = s+temp[i]
if (len(s) != length):
#This part is needed in cases of [1...000-1 = 0...111]
while( len(s) != length):
s = '0'+s
return nList
Sorry if the entire code is to lengthy or not optimized to a specific suiting. But it works, and it works well :)
Again, I currently have everything that aList would have, stored as a singly-linked list in c (which I am able to print/use). I also have a little macro I included in C to convert binary to an integer:
#define B(x) S_to_binary_(#x)
static inline unsigned long long S_to_binary_(const char *s)
{
unsigned long long i = 0;
while (*s) {
i <<= 1;
i += *s++ - '0';
}
return i;
}
This may be Coder's Block setting in, but I am not seeing how I can change the binary in the same way that I did in Python... Any help would be greatly appreciated! Also, as a note, what is typically the best way to return a finalized code in C?
EDIT:
Accidentally took credit for the macro above.
UPDATE
I just finished the code, and I uploaded it onto Github. I would like to thank #nneonneo for providing the step that I needed to finish it with exemplary code.If anyone has any further suggestions about the code, I would be happy to see there ideas on [Github]!
Why use a string at all? Keep it simple: use an integer, and use bitwise math to work with the number. Then you don't have to do any conversions back and forth. It will also be loads faster.
You can use a uint32_t to store the "bits", which is enough to hold 32 bits (since you max out at 21, this should work great).
For example, you can loop over the bits that are set by using a loop like this:
uint32_t my_number = ...;
for(int i=0; i<32; i++) {
if(my_number & (1<<i)) {
/* bit i is set */
}
}

Improve cython array indexing speed

I'm have a pretty simple function which I need to speed up. Essentially I have a big array of 16 bit numbers with some holes in it. (About 10%) I need to traverse the array, find areas where there are 2 0's in a row, then fill them in with the average of the previous and next elements. This takes only a few milliseconds in C, but Python is doing way worse.
I've switched from regular python arrays to numpy arrays, and then compiled my code using cython, but I'm still really far from my target. I was hoping someone with more experience might look at what I'm doing and give me some feedback.
My regular python code looks like this:
self.rawData = numpy.fromfile(ql, numpy.uint16, 50000)
[snip]
def fixZeroes(self):
for x in range(2,len(self.rawData)):
if self.rawData[x] == 0 and self.rawData[x-1] == 0:
self.rawData[x] = (self.rawData[x-2] + self.rawData[x+2]) / 2
self.rawData[x-1] = (self.rawData[x-3] + self.rawData[x+1]) /2
My Cython code looks very similar:
import numpy as np
cimport numpy as np
DTYPE = np.uint16
ctypedef np.uint16_t DTYPE_t
#cython.boundscheck(False)
def fix_zeroes(np.ndarray[DTYPE_t, ndim=1] raw):
assert raw.dtype == DTYPE
cdef int len = 50000
for x in range(2,len):
if raw[x] == 0 and raw[x-1] == 0:
raw[x] = (raw[x-2] + raw[x+2]) / 2
raw[x-1] = (raw[x-3] + raw[x+1]) /2
return raw
When I run this code, the performance is still way slower than I'd like:
Starting cython zero fix
Finished: 0:00:36.983681
starting python zero fix
Finished: 0:00:41.434476
I really think I must be doing something wrong. Most every article I've seen talks about the huge performance gains numpy and cython add, but I'm barely breaking 10%.
You should declare the x variable that you are using to index the raw array:
cdef int x
you can also use other directives that usually provide a performance boost:
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)

Why is cffi so much quicker than numpy?

I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844
Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.

Categories