Memory sharing in parallelized python code

Memory sharing in parallelized python code - python

I am a college freshman and Python newbie so bear with me. I am trying to parallelize some matrix operations. Here is my attempt using the ParallelPython module:
def testfunc(connectionMatrix, qCount, iCount, Htry, tStepCount):
test = connectionMatrix[0:qCount,0:iCount].dot(Htry[tStepCount-1, 0:iCount])
return test
f1 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.iHtry, self.tStepCount), modules = ("scipy.sparse",))
f2 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.didtHtry, self.tStepCount), modules = ("scipy.sparse",))
r1 = f1()
r2 = f2()
self.qHtry[self.tStepCount, 0:self.qCount] = self.qHtry[self.tStepCount-1, 0:self.qCount] + self.delT * r1 + 0.5 * (self.delT**2) * r2
It seems that there is a normal curve with size of the matrix on the x-axis and the percent speed-up on the y-axis. It seems to cap out at a 30% speed increase at 100x100 matrices. Smaller and larger matrices result in less increase and with small enough and large enough matrices, the serial code is faster. My guess is that the problem lies within the passing of the arguments. The overhead of copying the large matrix is actually taking longer than the job itself. What can I do to get around this? Is there some way to incorporate memory sharing and passing the matrix by reference? As you can see, none of the arguments are modified so it could be read-only access.
Thanks.

Well, the point of ParallelPython is that you can write code that doesn't care whether it's distributed across threads, processes, or even multiple computers, and using memory sharing would break that abstraction.
One option is to use something like a file on a shared filesystem, where you mmap that file in each worker. Of course that's more complicated, and whether it's better or worse will depend on a lot of details about the filesystem, sharing protocol, and network, but it's an option.
If you're willing to give up the option of distributed processing, you can use multiprocessing.Array (or multiprocessing,Value, or multiprocessing.sharedctypes) to access shared memory. But at that point, you might want to consider just using multiprocessing instead of ParallelPython for the job distribution, since multiprocessing is part of the standard library, and has a more powerful API, and you're explicitly giving up the one major advantage of ParallelPython.
Or you could combine the two options, for the worst of both worlds in many ways, but maybe the best in terms of how little you need to change your existing code: Just use a local file and mmap it.
However, before you do any of this, you may want to consider profiling to see if copying the matrix really is the bottleneck. And, if it is, you may want to consider whether there's an algorithmic fix, just copying the part each job needs instead of copying the entire matrix. (Whether that makes sense depends on whether the part each job needs is significantly less than the whole thing, of course.)

Related

Why is it better to use synchronous programming for in-memory operations?

I have a complex nested data structure. I iterate through it and perform some calculations on each possible uniqe pair of elements. It's all in-memory mathematical functions. I don't read from files or do networking.
It takes a few hours to run, with do_work() being called 25,000 times. I am looking for ways to speed it up.
Although Pool.map() seems useful for my lists, it's proving to be difficult because I need to pass extra arguments into the function being mapped.
I thought using the Python multitasking library would help, but when I use Pool.apply_async() to call do_work(), it actually takes longer.
I did some googling and a blogger says "Use sync for in-memory operations — async is a complete waste when you aren’t making blocking calls." Is this true? Can someone explain why? Do the RAM read & write operations interfere with each other? Why does my code take longer with async calls? do_work() writes calculation results to a database, but it doesn't modify my data structure.
Surely there is a way to utilize my processor cores instead of just linearly iterating through my lists.
My starting point, doing it synchronously:
main_list = [ [ [a,b,c,[x,y,z], ... ], ... ], ... ] # list of identical structures
helper_list = [1,2,3]
z = 2
for i_1 in range(0, len(main_list)):
for i_2 in range(0, len(main_list)):
if i_1 < i_2: # only unique combinations
for m in range(0, len(main_list[i_1])):
for h, helper in enumerate(helper_list):
do_work(
main_list[i_1][m][0], main_list[i_2][m][0], # unique combo
main_list[i_1][m][1], main_list[i_1][m][2],
main_list[i_1][m][3][z], main_list[i_2][m][3][h],
helper_list[h]
)
Variable names have been changed to make it more readable.

This is just a general answer, but too long for a comment...
First of all, I think your biggest bottleneck at this very moment is Python itself. I don't know what do_work() does, but if it's CPU intensive, you have the GIL which completely prevents effective parallelisation inside one process. No matter what you do, threads will fight for the GIL and it will eventually make your code even slower. Remember: Python has real threading, but the CPU is shared inside a single process.
I recommend checking out the page of David M Beazley: http://dabeaz.com/GIL/gilvis who did a lot of effort to visualise the GIL behaviour in Python.
On the other hand, the module multiprocessing allows you to run multiple processes and "circumvent" the GIL downsides, but it will be tricky to get access to the same memory locations without bigger penalties or trade-offs.
Second: if you utilise heavy nested loops, you should think about using numba and trying to fit your data structures inside numpy (structured) arrays. This can give you order of magnitude of speed quite easily. Python is slow as hell for such things but luckily there are ways to squeeze out a lot when using appropriate libraries.
To sum up, I think the code you are running could be orders of magnitudes faster with numba and numpy structures.
Alternatively, you can try to rewrite the code in a language like Julia (very similar syntax to Python and the community is extremely helpful) and quickly check how fast it is in order to explore the limits of the performance. It's always a good idea to get a feeling how fast something (or parts of a code) can be in a language which has not such complex performance critical aspects like Python.

Your task is more CPU bound than relying on I/O operations. Asynchronous execution make sense when you have long I/O operations i.e. sending/receiving something from network etc.
What you can do is split task to the chunks and utilize threads and multiprocessing (run on different CPU cores).

Prime number hard drive storage for very large primes - Sieve of Atkin

I have implemented the Sieve of Atkin and it works great up to primes nearing 100,000,000 or so. Beyond that, it breaks down because of memory problems.
In the algorithm, I want to replace the memory based array with a hard drive based array. Python's "wb" file functions and Seek functions may do the trick. Before I go off inventing new wheels, can anyone offer advice? Two issues appear at the outset:
Is there a way to "chunk" the Sieve of Atkin to work on segment in memory, and
is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them.
Why am I doing this? An old geezer looking for entertainment and to keep the noodle working.

Implementing the SoA in Python sounds fun, but note it will probably be slower than the SoE in practice. For some good monolithic SoE implementations, see RWH's StackOverflow post. These can give you some idea of the speed and memory use of very basic implementations. The numpy version will sieve to over 10,000M on my laptop.
What you really want is a segmented sieve. This lets you constrain memory use to some reasonable limit (e.g. 1M + O(sqrt(n)), and the latter can be reduced if needed). A nice discussion and code in C++ is shown at primesieve.org. You can find various other examples in Python. primegen, Bernstein's implementation of SoA, is implemented as a segmented sieve (Your question 1: Yes the SoA can be segmented). This is closely related (but not identical) to sieving a range. This is how we can use a sieve to find primes between 10^18 and 10^18+1e6 in a fraction of a second -- we certainly don't sieve all numbers to 10^18+1e6.
Involving the hard drive is, IMO, going the wrong direction. We ought to be able to sieve faster than we can read values from the drive (at least with a good C implementation). A ranged and/or segmented sieve should do what you need.
There are better ways to do storage, which will help some. My SoE, like a few others, uses a mod-30 wheel so has 8 candidates per 30 integers, hence uses a single byte per 30 values. It looks like Bernstein's SoA does something similar, using 2 bytes per 60 values. RWH's python implementations aren't quite there, but are close enough at 10 bits per 30 values. Unfortunately it looks like Python's native bool array is using about 10 bytes per bit, and numpy is a byte per bit. Either you use a segmented sieve and don't worry about it too much, or find a way to be more efficient in the Python storage.

First of all you should make sure that you store your data in an efficient manner. You could easily store the data for up to 100,000,000 primes in 12.5Mb of memory by using bitmap, by skipping obvious non-primes (even numbers and so on) you could make the representation even more compact. This also helps when storing the data on hard drive. You getting into trouble at 100,000,000 primes suggests that you're not storing the data efficiently.
Some hints if you don't receive a better answer.
1.Is there a way to "chunk" the Sieve of Atkin to work on segment in memory
Yes, for the Eratosthenes-like part what you could do is to run multiple elements in the sieve list in "parallell" (one block at a time) and that way minimize the disk accesses.
The first part is somewhat more tricky, what you would want to do is to process the 4*x**2+y**2, 3*x**2+y**2 and 3*x**2-y**2 in a more sorted order. One way is to first compute them and then sort the numbers, there are sorting algorithms that work well on drive storage (still being O(N log N)), but that would hurt the time complexity. A better way would be to iterate over x and y in such a way that you run on a block at a time, since a block is determined by an interval you could for example simply iterate over all x and y such that lo <= 4*x**2+y**2 <= hi.
2.is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them
In order to achieve this (no matter how and when the program is terminated) you have to first have journalizing disk accesses (fx use a SQL database to keep the data, but with care you could do it yourself).
Second since the operations in the first part are not indempotent you have to make sure that you don't repeat those operations. However since you would be running that part block by block you could simply detect which was the last block processed and resume there (if you can end up with partially processed block you'd just discard that and redo that block). For the Erastothenes part it's indempotent so you could just run through all of it, but for increasing speed you could store a list of produced primes after the sieving of them has been done (so you would resume with sieving after the last produced prime).
As a by-product you should even be able to construct the program in a way that makes it possible to keep the data from the first step even when the second step is running and thereby at a later moment extending the limit by continuing the first step and then running the second step again. Perhaps even having two program where you terminate the first when you've got tired of it and then feeding it's output to the Eratosthenes part (thereby not having to define a limit).

You could try using a signal handler to catch when your application is terminated. This could then save your current state before terminating. The following script shows a simple number count continuing when it is restarted.
import signal, os, cPickle
class MyState:
def __init__(self):
self.count = 1
def stop_handler(signum, frame):
global running
running = False
signal.signal(signal.SIGINT, stop_handler)
running = True
state_filename = "state.txt"
if os.path.isfile(state_filename):
with open(state_filename, "rb") as f_state:
my_state = cPickle.load(f_state)
else:
my_state = MyState()
while running:
print my_state.count
my_state.count += 1
with open(state_filename, "wb") as f_state:
cPickle.dump(my_state, f_state)
As for improving disk writes, you could try experimenting with increasing Python's own file buffering with a 1Mb or more sized buffer, e.g. open('output.txt', 'w', 2**20). Using a with handler should also ensure your file gets flushed and closed.

There is a way to compress the array. It may cost some efficiency depending on the python interpreter, but you'll be able to keep more in memory before having to resort to disk. If you search online, you'll probably find other sieve implementations that use compression.
Neglecting compression though, one of the easier ways to persist memory to disk would be through a memory mapped file. Python has an mmap module that provides the functionality. You would have to encode to and from raw bytes, but it is fairly straightforward using the struct module.
>>> import struct
>>> struct.pack('H', 0xcafe)
b'\xfe\xca'
>>> struct.unpack('H', b'\xfe\xca')
(51966,)

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?

Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.

I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

Logging an unknown number of floats in a python C extension

I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!

You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.

This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).

Python multiprocessing design

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:
Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids
The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.
I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.
So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?
Any pointers would be appreciated.

The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.
The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.
For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.
Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.
It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.

Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:
import os, time, random, multiprocessing
def busyfunc(runseconds):
starttime = int(time.clock())
while 1:
for randcount in range(0,100):
testnum = random.randint(1, 10000000)
newnum = testnum / 3.256
newtime = int(time.clock())
if newtime - starttime > runseconds:
return
def main(arg):
print 'arg from init:', arg
print "I am " + multiprocessing.current_process().name
busyfunc(15)
if __name__ == '__main__':
p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
p.start()
p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
p.start()
p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
p.start()
time.sleep(5)
This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.
If you must share data due to RAM limitations, then I suggest this:
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.
Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.
When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).
It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.
See 'extending and embedding' in the python documentation.
This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...

I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.
As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.
I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...
You can try segmenting your file and applying your algorithm to different sets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.