i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])
First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.
Related
I am wondering how to extract outputs from a multiprocessed function in Python. I am new to multiprocessing and have limited understanding of how it all works (not for lack of trying though).
I need to run the optimization with 31 different inputs for InfForecast and InitialStorage (for now... could be up to 10,000 inputs and independent optimizations being performed). I was hoping I could speed things up using multiprocessing to process more than one of these independent optimizations at a time. What I want is for the outputs (5 values for each optimization) to be put into the array "Nextday" which should have dimensions of (5,31). It seems the output Nextday as I've got the code written is either empty or not accessible. How do I extract/access the values and place them into Nextday?
Note: The function main(...) is a highly complex optimization problem. I hope the problem is easy enough to understand without providing it. It works when I loop over it and call it for each i in range(31).
from multiprocessing.pool import ThreadPool as Pool
Nextday=np.zeros((5,31))
pool_size = 4 # Should I set this to the number of cores my machine has?
pool = Pool(pool_size)
def optimizer(InfForecast, InitialStorage):
O=main(InfForecast,InitialStorage)
return [O[0][0], O[0][1], O[0][2], O[0][3], O[0][4]]
for i in range(31):
pool.apply_async(optimizer, (InfForecast[i],InitialStorage[i]))
pool.close()
Nextday=pool.join()
In addition to this, I'm not sure whether this is the best way to do things. If it's working (which I'm not sure it is) it sure seems slow. I was reading that it may be better to do multiprocessing vs threading and this seems to be threading? Forgive me if I'm wrong.
I am also curious about how to select pool_size as you can see in my comment in the code. I may be running this on a cloud server eventually, so I expect the pool_size I would want to use there would be slightly different than the number I will be using on my own machine. Is it just the number of cores?
Any advice would be appreciated.
You should use
from multiprocessing.pool import Pool
if you want to do multiprocessing.
Pool size should start out as multiprocessing.cpu_count() if you have the machine to yourself, and adjusted manually for best effect. If your processes are cpu-bound, then leaving a core available will make your machine more responsive -- if your code is not cpu-bound you can have more processes than cores (tuning this is finicky though, you'll just have to try).
You shouldn't have any code at the top-most level in your file when doing multiprocessing (or any other time really). Put everything in functions and call the start function from:
if __name__ == "__main__":
my_start_function()
(digression: using capital oh as a variable name is really bad, and you get statements that are almost unreadable in certain fonts like O[0][0]).
In regular python, the map function is "defined" by this equality:
map(fn, lst) == [fn(item) for item in lst]
so the Pool methods (imap/imap_unordered/map/map_async) has similar semantics, and in your case you would call them like:
def my_start_function():
...
results = pool.map(optimizer, zip(InfForecast, InitialStorage))
Since the map-functions take a function and a list, I've used the zip function to creates a list where each item has one element from each of its arguments (it functions as like a zipper).
I am making a webcrawler, and I have some "sleep" functions that make the crawl quite long.
For now I am doing :
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
deal_with (driver, year, quarter, speciality, ok)
The deal_with function is opening several webpages, waiting a few second for complete html download before moving on. The execution time is then very long : there is 25 * 10 * 2 = 500 loops, with no less than a minute by loop.
I would like to use my 4 physical Cores (8 threads) to enjoy parallelism.
I read about tornado, multiprocessing, joblib... and can't really make my mind on an easy solution to adapt to my code.
Any insight welcome :-)
tl;dr Investing in any choice without fully understanding the bottlenecks you are facing will not help you.
At the end of the day, there are only two fundamental approaches to scaling out a task like this:
Multiprocessing
You launch a number of Python processes, and distribute tasks to each of them. This is the approach you think will help you right now.
Some sample code for how this works, though you could use any appropriate wrapper:
import multiprocessing
# general rule of thumb: launch twice as many processes as cores
process_pool = multiprocessing.Pool(8) # launches 8 processes
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# feed your list of inputs to your process_pool and print it when done
print(process_pool.map(deal_with, inputs))
If this is all you wanted, you can stop reading now.
Asynchronous Execution
Here, you are content with a single thread or process, but you don't want it to be sitting idle waiting for stuff like network reads or disk seeks to come back - you want it to go on and do other, more important things while it's waiting.
True native asynchronous I/O support is provided in Python 3 and does not exist in Python 2.7 outside of the Twisted networking library.
import concurrent.futures
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# produce a pool of processes, and make sure they don't block each other
# - get back an object representing something yet to be resolved, that will
# only be updated when data comes in.
with concurrent.futures.ProcessPoolExecutor() as executor:
outputs = [executor.submit(input_tuple) for input_tuple in inputs]
# wait for all of them to finish - not ideal, since it defeats the purpose
# in production, but sufficient for an example
for future_object in concurrent.futures.as_completed(outputs):
# do something with future_object.result()
So What's the Difference?
My main point here it to emphasise that choosing from a list of technologies isn't as hard as figuring out where the real bottleneck is.
In the examples above, there isn't any difference. Both follow a simple pattern:
Have a lot of workers
Allow these workers to pick something from a queue of tasks right away
When one is free, set them to work on the next one right away.
Thus, you gain no conceptual difference altogether if you follow these examples verbatim, even though they use entirely different technologies and claim to use entirely different techniques.
Any technology you pick will be for naught if you write it in this pattern - even though you'll get some speedup, you will be sorely disappointed if you expected a massive performance boost.
Why is this pattern bad? Because it doesn't solve your problem.
Your problem is simple: you have wait. While your process is waiting for something to come back, it can't do anything else! It can't call more pages for you. It can't process an incoming task. All it can do is wait.
Having more processes that ultimately wait is not the true solution. An army of troops that has to march to Waterloo will not be faster if you split it into regiments - each regiment eventually has to sleep, though they may sleep at different times and for different lengths, and what will happen is that all of them will arrive at almost roughly the same time.
What you need is an army that never sleeps.
So What Should You Do?
Abstract all I/O bound tasks into something non-blocking. This is your true bottleneck. If you're waiting for a network response, don't let the poor process just sit there - give it something to do.
Your task is made somewhat difficult in that by default reading from a socket is blocking. It's the way operating systems are. Thankfully, you don't need to get Python 3 to solve it (though that is always the preferred solution) - the asyncore library (though Twisted is comparably superior in every way) already exists in Python 2.7 to make network reads and writes truly in the background.
There is one and only one case where true multiprocessing needs to be used in Python, and that's if you are doing CPU-bound or CPU-intensive work. From your description, it doesn't sound like that's the case.
In short, you should edit your deal_with function to avoid the incipient wait. Make that wait in the background, if needed, using a suitable abstraction from Twisted or asyncore. But don't make it consume your process completely.
If you're using python3, I would check out the asycio module. I believe you can just decorate deal_with with #asyncio.coroutine. You will likely have to adjust what deal_with does to properly work with the event loop as well.
I'm using CCKeyDerivationPBKDF to generate and verify password hashes in a concurrent environment and I'd like to know whether it it thread safe. The documentation of the function doesn't mention thread safety at all, so I'm currently using a lock to be on the safe side but I'd prefer not to use a lock if I don't have to.
After going through the source code of the CCKeyDerivationPBKDF() I find it to be "thread unsafe". While the code for CCKeyDerivationPBKDF() uses many library functions which are thread-safe(eg: bzero), most user-defined function(eg:PRF) and the underlying functions being called from those user-defined functions, are potentially thread-unsafe. (For eg. due to use of several pointers and unsafe casting of memory eg. in CCHMac). I would suggest unless they make all the underlying functions thread-safe or have some mechanism to alteast make it conditionally thread-safe, stick with your approach, or modify the commoncrypto code to make it thread-safe and use that code.
Hope it helps.
Lacking documentation or source code, one option is to build a test app with say 10 threads looping on calls to CCKeyDerivationPBKDF with a random selection from say 10 different sets of arguments with 10 known results.
Each thread checks the result of a call to make sure it is what is expected. Each thread should also have a usleep() call for some random amount of time (bell curve sitting on say 10% of the time each call to CCKeyDerivationPBKDF takes) in this loop in order to attempt to interleave operations as much as possible.
You'll probably want to instrument it with debugging that keeps track of how much concurrency you are able to generate. With a 10% sleep time and 10 threads, you should be able to keep 9 threads concurrent.
If it makes it through an aggregate of say 100,000,000 calls without an error, I'd assume it was thread safe. Of course you could run it for much longer than that to get greater assurances.
I have a code with heavy symbolic calculations (many multiple symbolic integrals). Also I have access to both an 8-core cpu computer (with 18 GB RAM) and a small 32 cpu cluster. I prefer to remain on my professor's 8-core pc rather than to go to another professor's lab using his cluster in a more limited time, however, I'm not sure it will work on the SMP system, so I am looking for a parallel tool in Python that can be used on both SMP and Clusters and of course prefer the codes on one system to be easily and with least effort modifiable for use on the other system.
So far, I have found Parallel Python (PP) promising for my need, but I have recently told that MPI also does the same (pyMPI or MPI4py). I couldn't approve this as seemingly very little is discussed about this on the web, only here it is stated that MPI (both pyMPI or MPI4py) is usable for clusters only, if I am right about that "only"!
Is "Parallel Python" my only choice, or I can also happily use MPI based solutions? Which one is more promising for my needs?
PS. It seems none of them have very comprehensive documentations so if you know some links to other than their official websites that can help a newbie in parallel computation I will be so grateful if you would also mention them in your answer :)
Edit.
My code has two loops one inside the other, the outer loop cannot be parallelized as it is an iteration method (a recursive solution) each step depending on the values calculated within its previous step. The outer loop contains the inner loop alongside 3 extra equations whose calculations depend on the whole results of the inner loop. However, the inner loop (which contains 9 out of 12 equations computable at each step) can be safely parallelized, all 3*3 equations are independent w.r.t each other, only depending on the previous step. All my equations are so computationally heavy as each contains many multiple symbolic integrals. Seemingly I can parallelize both the inner loop's 9 equations and the integration calculations in each of these 9 equation separately, and also parallelize all the integrations in other 3 equations alongside the inner loop. You can find my code here if it can help you better understand my need, it is written inside SageMath.
I would look in to multiprocessing (doc) which provides a bunch of nice tools for spawning and working with sub-processes.
To quote the documentation:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
From the comments I think the Pool and it's map would serve your purposes (doc).
def work_done_in_inner_loop(arg):
# put your work code here
pass
p = Pool(9)
for o in outer_loop:
# what ever else you do
list_of_args = [...] # what your inner loop currently loops over
res = p.map(work_done_in_inner_loop,list_of_args])
# rest of code
It seems like there are a few reasonable ways to design this.
Let me refer to your jobs as the main job, the 9 intermediate jobs, and the many inner jobs the intermediate jobs can spin off. I'm assuming the intermediate jobs have a "merge" step after the inner jobs all finish, and the same for the outer job.
The simplest design is that the main job fires off the intermediate jobs and then waits for them all to finish before doings its merge step. Then intermediate jobs then fire off the inner jobs and wait for them all to finish before doing their merge steps.
This can work with a single shared queue, but you need a queue that doesn't block the worker pool while waiting, and I don't think multiprocessing's Pool and Queue can do that out of the box. As soon as you've got all of your processes waiting to join their children, nothing gets done.
One way around that is to change to a continuation-passing style. If you know which one of the intermediate jobs will finish last, you can pass it the handles to the other intermediate jobs and have it join on them and do the merge, instead of the outer job. And the intermediate similarly pass off the merge to their last inner job.
The problem is that you usually have no way of knowing what's going to finish last, even without scheduling issues. So that means you need some form of either sharing (e.g., a semaphore) or message passing between the jobs to negotiate that among themselves. You can do that on top of multiprocessing. The only problem is that it destroys the independence of your jobs, and you're suddenly dealing with all the annoying problems of shared concurrency.
A different alternative is to have separate pools and queues for each intermediate job, and some kind of load balancing between the pools that can ensure that each core is running one active process.
Or, of course, a single pool with a more complicated implementation than multiprocessing's, which does either load balancing or cooperative scheduling, so a joiner doesn't block a core.
Or a super-simple solution: Overschedule, and pay a little cost in context switching for simplicity. For example, you can run 32 workers even though you've only got 8 cores, so you've got 22 active workers and 10 waiting. Each core has 2 or 3 active workers, which will slow things down a bit, but maybe not too badly—and at least nobody's idle, and you didn't have to write any code beyond passing a different parameter to the multiprocessing.Pool constructor.
At any rate, multiprocessing is very simple, and it has almost no extra concepts that won't apply to other solutions. So it may take less time to play with it until you run into a brick wall or don't, than to try to figure out in advance whether it'll work for you.
I recently ran into a similar problem. However, the following solution is only valid if (1) you wish to run the python script individually on a group of files, AND (2) each invocation of the script is independent of the others.
If the above applies to you, the simplest solution is to write a wrapper in bash along the lines of:
for a_file in $list_of_files
do
python python_script.py a_file &
done
The '&' will run the preceding command as a sub-process. The advantage is that bash will not wait for the python script to finish before continuing with the for loop.
You may want to place a cap on the number of processes running simultaneously, since this code will use all available resources.
I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.
The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.
You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.
You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.
I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.
Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit
a quick google for "python timeout" reveals a TimeoutFunction class
Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.