Extracting outputs from a multiprocessed python function

Extracting outputs from a multiprocessed python function - python

I am wondering how to extract outputs from a multiprocessed function in Python. I am new to multiprocessing and have limited understanding of how it all works (not for lack of trying though).
I need to run the optimization with 31 different inputs for InfForecast and InitialStorage (for now... could be up to 10,000 inputs and independent optimizations being performed). I was hoping I could speed things up using multiprocessing to process more than one of these independent optimizations at a time. What I want is for the outputs (5 values for each optimization) to be put into the array "Nextday" which should have dimensions of (5,31). It seems the output Nextday as I've got the code written is either empty or not accessible. How do I extract/access the values and place them into Nextday?
Note: The function main(...) is a highly complex optimization problem. I hope the problem is easy enough to understand without providing it. It works when I loop over it and call it for each i in range(31).
from multiprocessing.pool import ThreadPool as Pool
Nextday=np.zeros((5,31))
pool_size = 4 # Should I set this to the number of cores my machine has?
pool = Pool(pool_size)
def optimizer(InfForecast, InitialStorage):
O=main(InfForecast,InitialStorage)
return [O[0][0], O[0][1], O[0][2], O[0][3], O[0][4]]
for i in range(31):
pool.apply_async(optimizer, (InfForecast[i],InitialStorage[i]))
pool.close()
Nextday=pool.join()
In addition to this, I'm not sure whether this is the best way to do things. If it's working (which I'm not sure it is) it sure seems slow. I was reading that it may be better to do multiprocessing vs threading and this seems to be threading? Forgive me if I'm wrong.
I am also curious about how to select pool_size as you can see in my comment in the code. I may be running this on a cloud server eventually, so I expect the pool_size I would want to use there would be slightly different than the number I will be using on my own machine. Is it just the number of cores?
Any advice would be appreciated.

You should use
from multiprocessing.pool import Pool
if you want to do multiprocessing.
Pool size should start out as multiprocessing.cpu_count() if you have the machine to yourself, and adjusted manually for best effect. If your processes are cpu-bound, then leaving a core available will make your machine more responsive -- if your code is not cpu-bound you can have more processes than cores (tuning this is finicky though, you'll just have to try).
You shouldn't have any code at the top-most level in your file when doing multiprocessing (or any other time really). Put everything in functions and call the start function from:
if __name__ == "__main__":
my_start_function()
(digression: using capital oh as a variable name is really bad, and you get statements that are almost unreadable in certain fonts like O[0][0]).
In regular python, the map function is "defined" by this equality:
map(fn, lst) == [fn(item) for item in lst]
so the Pool methods (imap/imap_unordered/map/map_async) has similar semantics, and in your case you would call them like:
def my_start_function():
...
results = pool.map(optimizer, zip(InfForecast, InitialStorage))
Since the map-functions take a function and a list, I've used the zip function to creates a list where each item has one element from each of its arguments (it functions as like a zipper).

Related

Python Multithreading a for loop with limited Threads

i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])

First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.

Why does threading sorting functions work slower than sequentially running them in python? [duplicate]

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?

Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)

The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing

As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Python multiprocessing.Pool() doesn't use 100% of each CPU

I am working on multiprocessing in Python.
For example, consider the example given in the Python multiprocessing documentation (I have changed 100 to 1000000 in the example, just to consume more time). When I run this, I do see that Pool() is using all the 4 processes but I don't see each CPU moving upto 100%. How to achieve the usage of each CPU by 100%?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(f, range(10000000))

It is because multiprocessing requires interprocess communication between the main process and the worker processes behind the scene, and the communication overhead took more (wall-clock) time than the "actual" computation (x * x) in your case.
Try "heavier" computation kernel instead, like
def f(x):
return reduce(lambda a, b: math.log(a+b), xrange(10**5), x)
Update (clarification)
I pointed out that the low CPU usage observed by the OP was due to the IPC overhead inherent in multiprocessing but the OP didn't need to worry about it too much because the original computation kernel was way too "light" to be used as a benchmark. In other words, multiprocessing works the worst with such a way too "light" kernel. If the OP implements a real-world logic (which, I'm sure, will be somewhat "heavier" than x * x) on top of multiprocessing, the OP will achieve a decent efficiency, I assure. My argument is backed up by an experiment with the "heavy" kernel I presented.
#FilipMalczak, I hope my clarification makes sense to you.
By the way there are some ways to improve the efficiency of x * x while using multiprocessing. For example, we can combine 1,000 jobs into one before we submit it to Pool unless we are required to solve each job in real time (ie. if you implement a REST API server, we shouldn't do in this way).

You're asking wrong kind of question. multiprocessing.Process represents process as understood in your operating system. multiprocessing.Pool is just a simple way to run several processes to do your work. Python environment has nothing to do with balancing load on cores/processors.
If you want to control how will processor time be given to processes, you should try tweaking your OS, not python interpreter.
Of course, "heavier" computations will be recognised by system, and may look like they do just what you want to do, but in fact, you have almost no control on process handling.
"Heavier" functions will just look heavier to your OS, and his usual reaction will be assigning more processor time to your processes, but that doesn't mean you did what you wanted to - and that's good, because that the whole point of languages with VM - you specify logic, and VM takes care of mapping this logic onto operating system.

Is CCKeyDerivationPBKDF thread safe?

I'm using CCKeyDerivationPBKDF to generate and verify password hashes in a concurrent environment and I'd like to know whether it it thread safe. The documentation of the function doesn't mention thread safety at all, so I'm currently using a lock to be on the safe side but I'd prefer not to use a lock if I don't have to.

After going through the source code of the CCKeyDerivationPBKDF() I find it to be "thread unsafe". While the code for CCKeyDerivationPBKDF() uses many library functions which are thread-safe(eg: bzero), most user-defined function(eg:PRF) and the underlying functions being called from those user-defined functions, are potentially thread-unsafe. (For eg. due to use of several pointers and unsafe casting of memory eg. in CCHMac). I would suggest unless they make all the underlying functions thread-safe or have some mechanism to alteast make it conditionally thread-safe, stick with your approach, or modify the commoncrypto code to make it thread-safe and use that code.
Hope it helps.

Lacking documentation or source code, one option is to build a test app with say 10 threads looping on calls to CCKeyDerivationPBKDF with a random selection from say 10 different sets of arguments with 10 known results.
Each thread checks the result of a call to make sure it is what is expected. Each thread should also have a usleep() call for some random amount of time (bell curve sitting on say 10% of the time each call to CCKeyDerivationPBKDF takes) in this loop in order to attempt to interleave operations as much as possible.
You'll probably want to instrument it with debugging that keeps track of how much concurrency you are able to generate. With a 10% sleep time and 10 threads, you should be able to keep 9 threads concurrent.
If it makes it through an aggregate of say 100,000,000 calls without an error, I'd assume it was thread safe. Of course you could run it for much longer than that to get greater assurances.

parallel python, or MPI?

I have a code with heavy symbolic calculations (many multiple symbolic integrals). Also I have access to both an 8-core cpu computer (with 18 GB RAM) and a small 32 cpu cluster. I prefer to remain on my professor's 8-core pc rather than to go to another professor's lab using his cluster in a more limited time, however, I'm not sure it will work on the SMP system, so I am looking for a parallel tool in Python that can be used on both SMP and Clusters and of course prefer the codes on one system to be easily and with least effort modifiable for use on the other system.
So far, I have found Parallel Python (PP) promising for my need, but I have recently told that MPI also does the same (pyMPI or MPI4py). I couldn't approve this as seemingly very little is discussed about this on the web, only here it is stated that MPI (both pyMPI or MPI4py) is usable for clusters only, if I am right about that "only"!
Is "Parallel Python" my only choice, or I can also happily use MPI based solutions? Which one is more promising for my needs?
PS. It seems none of them have very comprehensive documentations so if you know some links to other than their official websites that can help a newbie in parallel computation I will be so grateful if you would also mention them in your answer :)
Edit.
My code has two loops one inside the other, the outer loop cannot be parallelized as it is an iteration method (a recursive solution) each step depending on the values calculated within its previous step. The outer loop contains the inner loop alongside 3 extra equations whose calculations depend on the whole results of the inner loop. However, the inner loop (which contains 9 out of 12 equations computable at each step) can be safely parallelized, all 3*3 equations are independent w.r.t each other, only depending on the previous step. All my equations are so computationally heavy as each contains many multiple symbolic integrals. Seemingly I can parallelize both the inner loop's 9 equations and the integration calculations in each of these 9 equation separately, and also parallelize all the integrations in other 3 equations alongside the inner loop. You can find my code here if it can help you better understand my need, it is written inside SageMath.

I would look in to multiprocessing (doc) which provides a bunch of nice tools for spawning and working with sub-processes.
To quote the documentation:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
From the comments I think the Pool and it's map would serve your purposes (doc).
def work_done_in_inner_loop(arg):
# put your work code here
pass
p = Pool(9)
for o in outer_loop:
# what ever else you do
list_of_args = [...] # what your inner loop currently loops over
res = p.map(work_done_in_inner_loop,list_of_args])
# rest of code

It seems like there are a few reasonable ways to design this.
Let me refer to your jobs as the main job, the 9 intermediate jobs, and the many inner jobs the intermediate jobs can spin off. I'm assuming the intermediate jobs have a "merge" step after the inner jobs all finish, and the same for the outer job.
The simplest design is that the main job fires off the intermediate jobs and then waits for them all to finish before doings its merge step. Then intermediate jobs then fire off the inner jobs and wait for them all to finish before doing their merge steps.
This can work with a single shared queue, but you need a queue that doesn't block the worker pool while waiting, and I don't think multiprocessing's Pool and Queue can do that out of the box. As soon as you've got all of your processes waiting to join their children, nothing gets done.
One way around that is to change to a continuation-passing style. If you know which one of the intermediate jobs will finish last, you can pass it the handles to the other intermediate jobs and have it join on them and do the merge, instead of the outer job. And the intermediate similarly pass off the merge to their last inner job.
The problem is that you usually have no way of knowing what's going to finish last, even without scheduling issues. So that means you need some form of either sharing (e.g., a semaphore) or message passing between the jobs to negotiate that among themselves. You can do that on top of multiprocessing. The only problem is that it destroys the independence of your jobs, and you're suddenly dealing with all the annoying problems of shared concurrency.
A different alternative is to have separate pools and queues for each intermediate job, and some kind of load balancing between the pools that can ensure that each core is running one active process.
Or, of course, a single pool with a more complicated implementation than multiprocessing's, which does either load balancing or cooperative scheduling, so a joiner doesn't block a core.
Or a super-simple solution: Overschedule, and pay a little cost in context switching for simplicity. For example, you can run 32 workers even though you've only got 8 cores, so you've got 22 active workers and 10 waiting. Each core has 2 or 3 active workers, which will slow things down a bit, but maybe not too badly—and at least nobody's idle, and you didn't have to write any code beyond passing a different parameter to the multiprocessing.Pool constructor.
At any rate, multiprocessing is very simple, and it has almost no extra concepts that won't apply to other solutions. So it may take less time to play with it until you run into a brick wall or don't, than to try to figure out in advance whether it'll work for you.

I recently ran into a similar problem. However, the following solution is only valid if (1) you wish to run the python script individually on a group of files, AND (2) each invocation of the script is independent of the others.
If the above applies to you, the simplest solution is to write a wrapper in bash along the lines of:
for a_file in $list_of_files
do
python python_script.py a_file &
done
The '&' will run the preceding command as a sub-process. The advantage is that bash will not wait for the python script to finish before continuing with the for loop.
You may want to place a cap on the number of processes running simultaneously, since this code will use all available resources.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.