Where to write parallelized program output to? - python

I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.
What I have is structured like this:
initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()
# now how would I get the output?
send_ftp_of_output(output_data)
Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?

pool.map(function,input)
returns a list.
You can get the output by doing:
output_data = pool.map(function,input)
pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.

In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.
You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.
Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel (http://www.gnu.org/software/parallel/) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).
The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.
If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel, make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.
I use GNU Parallel quite often in conjunction with Python, often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.

Using Tritlo's suggestion, here is what worked for me:
def run_updates(input_data):
# do something
return {data}
if __name__ == '__main__':
item = iTunes()
item.fetch_itunes_pulldowns_to_do()
initial_input_data = item.fetched_update_info
pool = Pool(NUM_IN_PARALLEL)
result = pool.map(run_updates, initial_input_data)
pool.close()
pool.join()
print result
And this gives me a list of results

Related

Extracting outputs from a multiprocessed python function

I am wondering how to extract outputs from a multiprocessed function in Python. I am new to multiprocessing and have limited understanding of how it all works (not for lack of trying though).
I need to run the optimization with 31 different inputs for InfForecast and InitialStorage (for now... could be up to 10,000 inputs and independent optimizations being performed). I was hoping I could speed things up using multiprocessing to process more than one of these independent optimizations at a time. What I want is for the outputs (5 values for each optimization) to be put into the array "Nextday" which should have dimensions of (5,31). It seems the output Nextday as I've got the code written is either empty or not accessible. How do I extract/access the values and place them into Nextday?
Note: The function main(...) is a highly complex optimization problem. I hope the problem is easy enough to understand without providing it. It works when I loop over it and call it for each i in range(31).
from multiprocessing.pool import ThreadPool as Pool
Nextday=np.zeros((5,31))
pool_size = 4 # Should I set this to the number of cores my machine has?
pool = Pool(pool_size)
def optimizer(InfForecast, InitialStorage):
O=main(InfForecast,InitialStorage)
return [O[0][0], O[0][1], O[0][2], O[0][3], O[0][4]]
for i in range(31):
pool.apply_async(optimizer, (InfForecast[i],InitialStorage[i]))
pool.close()
Nextday=pool.join()
In addition to this, I'm not sure whether this is the best way to do things. If it's working (which I'm not sure it is) it sure seems slow. I was reading that it may be better to do multiprocessing vs threading and this seems to be threading? Forgive me if I'm wrong.
I am also curious about how to select pool_size as you can see in my comment in the code. I may be running this on a cloud server eventually, so I expect the pool_size I would want to use there would be slightly different than the number I will be using on my own machine. Is it just the number of cores?
Any advice would be appreciated.
You should use
from multiprocessing.pool import Pool
if you want to do multiprocessing.
Pool size should start out as multiprocessing.cpu_count() if you have the machine to yourself, and adjusted manually for best effect. If your processes are cpu-bound, then leaving a core available will make your machine more responsive -- if your code is not cpu-bound you can have more processes than cores (tuning this is finicky though, you'll just have to try).
You shouldn't have any code at the top-most level in your file when doing multiprocessing (or any other time really). Put everything in functions and call the start function from:
if __name__ == "__main__":
my_start_function()
(digression: using capital oh as a variable name is really bad, and you get statements that are almost unreadable in certain fonts like O[0][0]).
In regular python, the map function is "defined" by this equality:
map(fn, lst) == [fn(item) for item in lst]
so the Pool methods (imap/imap_unordered/map/map_async) has similar semantics, and in your case you would call them like:
def my_start_function():
...
results = pool.map(optimizer, zip(InfForecast, InitialStorage))
Since the map-functions take a function and a list, I've used the zip function to creates a list where each item has one element from each of its arguments (it functions as like a zipper).

Is it possible to avoid locking overhead when sharing dicts between threads in Python?

I have a multi-threaded application in Python where threads are reading very large (so I cannot copy them to thread-local storage) dicts (read from disk and never modified). Then they process huge amounts of data using the dicts as read-only data:
# single threaded
d1,d2,d3 = read_dictionaries()
while line in stdin:
stdout.write(compute(line,d1,d2,d3)+line)
I am trying to speed this up by using threads, which would then each read its own input and write its own output, but since the dicts are huge, I want the threads to share the storage.
IIUC, every time a thread reads from the dict, it has to lock it, and that imposes a performance cost on the application. This data locking is not necessary because the dicts are read-only.
Does CPython actually lock the data individually or does it just use the GIL?
If, indeed, there is per-dict locking, is there a way to avoid it?
Multithreading processing in python is useless. It's better to use multiprocessing module. Because multithreading can give positive effort only in lower number of cases.
Python implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
Official documentation.
Without any code examples from your side I can only recommend to split your big dictionary on several parts and process every part using Pool.map. And merge results in main process.
Unfortunately, it's impossible to share a lot of memory between different python processes effective (we are not talking about shared memory pattern based on mmap). But you can read different parts of your dictionary in different processes. Or just read entire dictionary in main process and give a small chunks to child processes.
Also, I should warn you that you should be very carefully with multiprocessing algorithms. Because every extra megabytes will be multiplied on number of process.
So, based on your pseudocode example I can assume two possible algorithm based on compute function:
# "Stateless"
for line in stdin:
res = compute_1(line) + compute_2(line) + compute_3(line)
print res, line
# "Shared" state
for line in stdin:
res = compute_1(line)
res = compute_2(line, res)
res = compute_3(line, res)
print res, line
In first case, you can create a several workers, read each dictionary in separate worker based on Process class (it's good idea to decrease memory usage for each process), and compute it like a production line.
In second case, you have a shared state. For each next worker you need a result of previous one. It's worst case for multithreading/multiprocessing programming. But you can write algorithm there several workers are using same Queue and pushing result to it without waiting finish of all cycle. And you just share a Queue instance between your processes.

parallel python, or MPI?

I have a code with heavy symbolic calculations (many multiple symbolic integrals). Also I have access to both an 8-core cpu computer (with 18 GB RAM) and a small 32 cpu cluster. I prefer to remain on my professor's 8-core pc rather than to go to another professor's lab using his cluster in a more limited time, however, I'm not sure it will work on the SMP system, so I am looking for a parallel tool in Python that can be used on both SMP and Clusters and of course prefer the codes on one system to be easily and with least effort modifiable for use on the other system.
So far, I have found Parallel Python (PP) promising for my need, but I have recently told that MPI also does the same (pyMPI or MPI4py). I couldn't approve this as seemingly very little is discussed about this on the web, only here it is stated that MPI (both pyMPI or MPI4py) is usable for clusters only, if I am right about that "only"!
Is "Parallel Python" my only choice, or I can also happily use MPI based solutions? Which one is more promising for my needs?
PS. It seems none of them have very comprehensive documentations so if you know some links to other than their official websites that can help a newbie in parallel computation I will be so grateful if you would also mention them in your answer :)
Edit.
My code has two loops one inside the other, the outer loop cannot be parallelized as it is an iteration method (a recursive solution) each step depending on the values calculated within its previous step. The outer loop contains the inner loop alongside 3 extra equations whose calculations depend on the whole results of the inner loop. However, the inner loop (which contains 9 out of 12 equations computable at each step) can be safely parallelized, all 3*3 equations are independent w.r.t each other, only depending on the previous step. All my equations are so computationally heavy as each contains many multiple symbolic integrals. Seemingly I can parallelize both the inner loop's 9 equations and the integration calculations in each of these 9 equation separately, and also parallelize all the integrations in other 3 equations alongside the inner loop. You can find my code here if it can help you better understand my need, it is written inside SageMath.
I would look in to multiprocessing (doc) which provides a bunch of nice tools for spawning and working with sub-processes.
To quote the documentation:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
From the comments I think the Pool and it's map would serve your purposes (doc).
def work_done_in_inner_loop(arg):
# put your work code here
pass
p = Pool(9)
for o in outer_loop:
# what ever else you do
list_of_args = [...] # what your inner loop currently loops over
res = p.map(work_done_in_inner_loop,list_of_args])
# rest of code
It seems like there are a few reasonable ways to design this.
Let me refer to your jobs as the main job, the 9 intermediate jobs, and the many inner jobs the intermediate jobs can spin off. I'm assuming the intermediate jobs have a "merge" step after the inner jobs all finish, and the same for the outer job.
The simplest design is that the main job fires off the intermediate jobs and then waits for them all to finish before doings its merge step. Then intermediate jobs then fire off the inner jobs and wait for them all to finish before doing their merge steps.
This can work with a single shared queue, but you need a queue that doesn't block the worker pool while waiting, and I don't think multiprocessing's Pool and Queue can do that out of the box. As soon as you've got all of your processes waiting to join their children, nothing gets done.
One way around that is to change to a continuation-passing style. If you know which one of the intermediate jobs will finish last, you can pass it the handles to the other intermediate jobs and have it join on them and do the merge, instead of the outer job. And the intermediate similarly pass off the merge to their last inner job.
The problem is that you usually have no way of knowing what's going to finish last, even without scheduling issues. So that means you need some form of either sharing (e.g., a semaphore) or message passing between the jobs to negotiate that among themselves. You can do that on top of multiprocessing. The only problem is that it destroys the independence of your jobs, and you're suddenly dealing with all the annoying problems of shared concurrency.
A different alternative is to have separate pools and queues for each intermediate job, and some kind of load balancing between the pools that can ensure that each core is running one active process.
Or, of course, a single pool with a more complicated implementation than multiprocessing's, which does either load balancing or cooperative scheduling, so a joiner doesn't block a core.
Or a super-simple solution: Overschedule, and pay a little cost in context switching for simplicity. For example, you can run 32 workers even though you've only got 8 cores, so you've got 22 active workers and 10 waiting. Each core has 2 or 3 active workers, which will slow things down a bit, but maybe not too badly—and at least nobody's idle, and you didn't have to write any code beyond passing a different parameter to the multiprocessing.Pool constructor.
At any rate, multiprocessing is very simple, and it has almost no extra concepts that won't apply to other solutions. So it may take less time to play with it until you run into a brick wall or don't, than to try to figure out in advance whether it'll work for you.
I recently ran into a similar problem. However, the following solution is only valid if (1) you wish to run the python script individually on a group of files, AND (2) each invocation of the script is independent of the others.
If the above applies to you, the simplest solution is to write a wrapper in bash along the lines of:
for a_file in $list_of_files
do
python python_script.py a_file &
done
The '&' will run the preceding command as a sub-process. The advantage is that bash will not wait for the python script to finish before continuing with the for loop.
You may want to place a cap on the number of processes running simultaneously, since this code will use all available resources.

updating a shelve dictionary in python parallely

I have a program that takes a very huge input file and makes a dict out of it. Since there is no way this is going to fit in memory, I Decided to use shelve to write it to my disk. Now I need to take advantage of the multiple cores available in my system (8 of them) so that I can speed up my parsing. The most obvious way to do this I thought was to split my input file into 8 parts and run the code on all 8 parts concurrently. The problem is that I need only 1 dictionary in the end. Not 8 of them. So how do I use shelve to update one single dictionary parallely?
I gave a pretty detailed answer here on Processing single file from multiple processes in python
Don't try to figure out how you can have many processes write to a shelve at once. Think about how you can have a single process deliver results to the shelve.
The idea is that you have a single process producing the input to a queue. Then you have as many workers as you want receiving queued items and doing the work. When they are done, they place the result into a result queue for the sink to read. The benefit is that you do not have to manually split up your work ahead of time. Just produce the "input" and let whatever worker is read take it and work on it.
With this pattern, you can scale up or down the workers based on the system capabilities.
shelve doesn't support concurrent access. There are a few options for accomplishing what you want:
Make one shelf per process and then merge at the end.
Have worker processes send their results back to the master process over eg multiprocessing.Pipe; the master then stores them in the shelf.
I think you can get bsddb to work with concurrent access in a shelve-like API, but I've never had the need to do so.

How to have multiple python programs append rows to the same file?

I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.
Is this a safe way for these programs to work?
Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.
Thanks in advance for any & all answers and suggestions.
Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.
Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.
Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).
Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).
Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.

Categories