I'm trying to parallelize some Python code using processes and concurrent.futures. It looks like I can execute a function multiple times in parrallel either by submitting calls and then calling Future.result() on the futures, or by using Executor.map().
I'm wondering if the latter is just a syntactic sugar for the former and if there's any difference performance-wise. It doesn't seem immediately clear from the documentation.
It will allow you to execute a function multiple times concurrently instead true parallel execution.
Performance wise, I recently found that the ProcessPoolExecutor.submit() and ProcessPoolExecutor.map() consumed the same amount of compute time to complete the same task. Note: .submit() returns a future object (let's call it f) and you need to use it's f.result option to see it's result. On the other hand, .map() directly returns an iterator.
When converting their results into an ordered list using the sorted method, I have found that compute time of the entire .map()code can be faster than entire .submit() code in certain scenarios.
When converting their results into an unordered list using the list method, the compute time of the entire .submit() and .map() codes are the same. Also, these codes performed faster than the codes using the sorted method.
You can read the details in my answer. There, I have also shared my codes where you can see how they work. I hope they can be helpful to you.
I have not used ThreadPoolExecutor so I can't comment in detail. However, I have read that they are implemented the same way as the ProcessPoolExecutor and they are more suited to be used for I/O bound tasks instead of CPU bound tasks. You do need to specify the max_workers argument, i.e. the max number of threads, whereas in the ProcessPoolExecutor max_workers is an optional argument which defaults to the number of CPUs returned by os.cpu_count().
Related
The Dask distributed library documentation says:
By default, distributed assumes that all functions are pure.
[...]
The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it.
When benchmarking function runtimes, this caching behavior will get in the way, as we want to call the same function multiple times with the same inputs.
So is there a way to completely disable it?
I know that for submit and map there is an argument available. But for computations on dask collections I have not found any solution.
After some digging in the source code of distributed, I believe I have found an answer myself. Although someone might correct me if I didn't come to the right conclusion.
Short answer
It is not possible to globally disable disable the purity assumption in distributed.
However, for dask collection it is possible to separate computations from precomputed results with dask.graph_manipulation.clone().
Long answer
Internally, dask splits its computation up into labelled tasks.
A task label is called a "key" and is used to identify results from a computation (an execution of a task). Keys are used to identify dependencies between tasks and are therefore essential for how dask works.
When we submit a new computation graph, which is basically a list of tasks with their dependencies, to the scheduler in distributed, the scheduler checks whether some tasks have already been computed by checking their keys against the keys of the finished tasks, which the scheduler still holds.
This happens quite at the beginning of Scheduler.update_graph(), which is the method being called by the client when he wants to start a new computation.
There is no switch in the current implementation to disable this. The calls to plugin.update_graph() for the registered scheduler plugins also happen after this optimization phase, so we can neither regulate this behavior through plugins.
So what can we do?
By manually modifying the keys of the individual tasks in the graph, we can trick the scheduler into thinking that we have not yet computed this task.
Task keys usually have the format prefix-token, where the prefix is the original task name (e.g. function name) and the token is a hash built from the arguments of the task.
Distributed uses the task prefix to group tasks together and get an estimate for future runtimes.
The token is primarily used to identify different executions of the same task with different arguments.
So we can just adjust the token of the key to let dask think that we are running the task with different arguments.
This is in principle what dask.graph_manipulation.clone() does for us. It copies a dask collection and returns a new one such that the keys of the tasks in the internal graph are rewritten.
i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])
First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.
I am confused about the python multiprocessing module. Suppose we write the code like this:
pool = Pool()
for i in len(tasks) :
pool.apply(task_function, (tasks[i],))
Firstly i = 0, and the first subprocessor will created and execute the first task. Since we are using the apply instead of apply_async, the main processor is blocked, so there is no chance that i get increment, and execute the second task. So by doing this way, we are actually write a serial code, not run in multiprocessing? So the same is true when we use map instead of map_async? No wonder the result of these tasks comes in order. If this is the truth, we don't even bother to use multiprocessing's map and apply function. Correct me, if I am wrong
According to the documentation:
apply(func[, args[, kwds]])
Equivalent of the apply() built-in function. It blocks until
the result is ready, so apply_async() is better suited for
performing work in parallel. Additionally, func is only executed
in one of the workers of the pool.
So yes, if you want to delegate work to another process and return control to your main process, you have to use apply_async.
Regarding your statement:
If this is the truth, we don't even bother to use
multiprocessing's map and apply function
Depends on what you want to do. For example map will split the arguments into chunks and apply the function for each chunk in the different processes of the pool, so you are achieving parallelism. This would work for your example:
pool.map(task_funcion, tasks)
It will split tasks into pieces, and then call task_function on each process from the pool with the different pieces of tasks. So for example you could have Process1 running task_function(task1), Process2 running task_function(task2) all at the same time.
I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.
What I have is structured like this:
initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()
# now how would I get the output?
send_ftp_of_output(output_data)
Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?
pool.map(function,input)
returns a list.
You can get the output by doing:
output_data = pool.map(function,input)
pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.
In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.
You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.
Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel (http://www.gnu.org/software/parallel/) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).
The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.
If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel, make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.
I use GNU Parallel quite often in conjunction with Python, often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.
Using Tritlo's suggestion, here is what worked for me:
def run_updates(input_data):
# do something
return {data}
if __name__ == '__main__':
item = iTunes()
item.fetch_itunes_pulldowns_to_do()
initial_input_data = item.fetched_update_info
pool = Pool(NUM_IN_PARALLEL)
result = pool.map(run_updates, initial_input_data)
pool.close()
pool.join()
print result
And this gives me a list of results
I'm using CCKeyDerivationPBKDF to generate and verify password hashes in a concurrent environment and I'd like to know whether it it thread safe. The documentation of the function doesn't mention thread safety at all, so I'm currently using a lock to be on the safe side but I'd prefer not to use a lock if I don't have to.
After going through the source code of the CCKeyDerivationPBKDF() I find it to be "thread unsafe". While the code for CCKeyDerivationPBKDF() uses many library functions which are thread-safe(eg: bzero), most user-defined function(eg:PRF) and the underlying functions being called from those user-defined functions, are potentially thread-unsafe. (For eg. due to use of several pointers and unsafe casting of memory eg. in CCHMac). I would suggest unless they make all the underlying functions thread-safe or have some mechanism to alteast make it conditionally thread-safe, stick with your approach, or modify the commoncrypto code to make it thread-safe and use that code.
Hope it helps.
Lacking documentation or source code, one option is to build a test app with say 10 threads looping on calls to CCKeyDerivationPBKDF with a random selection from say 10 different sets of arguments with 10 known results.
Each thread checks the result of a call to make sure it is what is expected. Each thread should also have a usleep() call for some random amount of time (bell curve sitting on say 10% of the time each call to CCKeyDerivationPBKDF takes) in this loop in order to attempt to interleave operations as much as possible.
You'll probably want to instrument it with debugging that keeps track of how much concurrency you are able to generate. With a 10% sleep time and 10 threads, you should be able to keep 9 threads concurrent.
If it makes it through an aggregate of say 100,000,000 calls without an error, I'd assume it was thread safe. Of course you could run it for much longer than that to get greater assurances.