I have a lot of tasks (independent of each other, represented by some code in Python) that need to be executed. Their execution time varies. I also have limited resources so at most N tasks can be running at the same time. The goal is to finish executing the whole stack of tasks as fast as possible.
It seems that I am looking for some kind of manager that starts new tasks when the resource gets available and collects finished tasks.
Are there any already-made solutions or should I code it myself?
Are there any caveats that I should keep in mind?
as far as I can tell your main would just become:
def main():
tasks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
with multiprocessing.Pool(POOL_SIZE) as pool:
pool.map(sleep, tasks)
i.e. you've just reimplemented a pool, but inefficiently (Pool reuses Processes where possible) and in not as safely, Pool goes to lots of effort to cleanup on exceptions
Here is a simple code snippet that should fit the requirements:
import multiprocessing
import time
POOL_SIZE = 4
STEP = 1
def sleep(seconds: int):
time.sleep(seconds)
def main():
tasks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pool = [None] * POOL_SIZE
while tasks or [item for item in pool if item is not None]:
for i in range(len(pool)):
if pool[i] is not None and not pool[i].is_alive():
# Finished task. Clear the resource.
pool[i] = None
if pool[i] is None:
# Free resource. Start new task if any are left.
if tasks:
task = tasks.pop(0)
pool[i] = multiprocessing.Process(target=sleep, args=(task,))
pool[i].start()
time.sleep(STEP)
if __name__ == '__main__':
main()
The manager has a tasks list of arbitrary length, here are tasks for simplicity represented by integers that are being placed as arguments to a sleep function. It also has a pool list, initially empty, representing the available resource.
The manager periodically visits all currently running processes and checks if they are finished or not. It also starts new processes if the resource becomes available. The whole cycle is being repeated until there are no tasks and no currently running processes left. The STEP value is here to save the computing power - you generally don't need to check the running processes every millisecond.
As for the caveats, there are some guidelines that should be kept in mind when using multiprocessing.
Related
I understand this is a slightly vague and open ended question, but I need some help in this area as a quick Google/Stack Overflow search hasn't yielded useful information.
The basic idea is to use multiple processes to speed up an expensive computation that currently gets executed sequentially in a loop. The caveat being that I have 2 significant data structures that are accessed by the expensive function:
one data structure will be read by all processes but is not ever modified by a process (so could be copied to each process, assuming memory size isn't an issue, which, in this case, it isn't)
the other data structure will spend most of the time being read by processes, but will occasionally be written to, and this update needs to be propagated to all processes from that point onwards
Currently the program works very basically like so:
def do_all_the_things(self):
read_only_obj = {...}
read_write_obj = {...}
output = []
for i in range(4):
for j in range(4):
output.append(do_expensive_operation(read_only_obj, read_write_obj))
return output
In a uniprocessor world, this is fine as any changes made to read_write_obj are accessed sequentially.
What I am looking to do is to run each instance of do_expensive_operation in a separate process so that a multiprocessor can be fully utilised.
The two things I am looking to understand are:
How does the whole multiprocessing thing work. I have seen Queues and Pools and don't understand which I should be using in this situation?
I have a feeling sharing memory (read_only_obj and read_write_obj) is going to be complicated. Is this possible? Advisable? And how do I go about it?
Thank you for your time!
Disclamer: I will help you and will provide you with a working example but I am not an expert in this topic.
Point 1 has been answered here to some extent.
Point 2 has been answered here to some extent.
I used different options in the past for CPU-bound tasks in python and here is one toy example for you to follow:
from multiprocessing import Process, Queue
import time, random
def do_something(n_order, x, queue):
time.sleep(5)
queue.put((idx, x))
def main():
data = [1,2,3,4,5]
queue = Queue()
processes = [Process(target=do_something, args=(n,x,queue)) for n,x in enumerate(data)]
for p in processes:
p.start()
for p in processes:
p.join()
unsorted_result = [queue.get() for _ in processes]
result = [i[1] for i in sorted(unsorted_result)]
print(result)
You can write the same but in a loop instead of using queues and check the time consumed (in this silly case is the sleep, for testing purposes) and you will realized that you shortened the time approximately by the number of processes that you run, as expected.
In fact, this is the results in my computer for the exact script that I provide you with (first multiprocess and the second loop):
[1, 2, 3, 4, 5]
real 0m5.240s
user 0m0.397s
sys 0m0.260s
[1, 4, 9, 16, 25]
real 0m25.104s
user 0m0.051s
sys 0m0.030s
With respect to read_only or read and write objects, I will need more information to provide help. What type of objects are those? Are they indexed?
I'm using Python 2.7's multiprocessing.Pool to manage a pool of 3 workers. Each worker is fairly complicated and there's a resource leak (presumably) in some third-party code that causes problems after 6-8 hours of continuous runtime. So I'd like to use maxtasksperchild to have workers refreshed periodically.
I'd also like each worker to write to its own separate log file. Without maxtasksperchild I use a shared multiprocessing.Value to assign an integer (0, 1, or 2) to each worker, then use the integer to name the log file.
With maxtasksperchild I'd like to reuse log files once a worker is done. So if this whole thing runs for a month, I only want three log files, not one log file for each worker that was spawned.
If I could pass a callback (e.g. a finalizer to go along with the initializer currently supported), this would be straightforward. Without that, I can't see a robust and simple way to do it.
That's AFAIK undocumented, but multiprocessing has a Finalizer class, "which supports object finalization using weakrefs". You could use it to register a finalizer within your initializer.
I don't see multiprocessing.Value a helpful synchronization choice in this case, though. Multiple workers could exit simultaneously, signaling which file-integers are free is more than a (locked) counter could provide then.
I would suggest use of multiple bare multiprocessing.Locks, one for each file, instead:
from multiprocessing import Pool, Lock, current_process
from multiprocessing.util import Finalize
def f(n):
global fileno
for _ in range(int(n)): # xrange for Python 2
pass
return fileno
def init_fileno(file_locks):
for i, lock in enumerate(file_locks):
if lock.acquire(False): # non-blocking attempt
globals()['fileno'] = i
print("{} using fileno: {}".format(current_process().name, i))
Finalize(lock, lock.release, exitpriority=15)
break
if __name__ == '__main__':
n_proc = 3
file_locks = [Lock() for _ in range(n_proc)]
pool = Pool(
n_proc, initializer=init_fileno, initargs=(file_locks,),
maxtasksperchild=2
)
print(pool.map(func=f, iterable=[50e6] * 18))
pool.close()
pool.join()
# all locks should be available if all finalizers did run
assert all(lock.acquire(False) for lock in file_locks)
Output:
ForkPoolWorker-1 using fileno: 0
ForkPoolWorker-2 using fileno: 1
ForkPoolWorker-3 using fileno: 2
ForkPoolWorker-4 using fileno: 0
ForkPoolWorker-5 using fileno: 1
ForkPoolWorker-6 using fileno: 2
[0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]
Process finished with exit code 0
Note that with Python 3 you can't use Pool's context-manager reliably instead of the old way of doing it shown above. Pool's context-manager (unfortunately) calls terminate(), which might kill worker-processes before the finalizer had a chance to run.
I ended up going with the following. It assumes that PIDs aren't recycled very quickly (true on Ubuntu for me, but not in general on Unix). I don't think it makes any other assumptions, but I'm really just interested in Ubuntu so I didn't look at other platforms such as Windows carefully.
The code use an array to keep track of which PIDs have claimed which index. Then when a new worker is started, it looks to see if any PIDs are no longer in use. If it finds one, it assumes this is because the worker has completed its work (or been terminated for another reason). If it doesn't find one then we're out of luck! So this isn't perfect but I think its simpler than anything I've seen so far or considered.
def run_pool():
child_pids = Array('i', 3)
pool = Pool(3, initializser=init_worker, initargs=(child_pids,), maxtasksperchild=1000)
def init_worker(child_pids):
with child_pids.get_lock():
available_index = None
for index, pid in enumerate(child_pids):
# PID 0 means unallocated (this happens when our pool is started), we reclaim PIDs
# which are no longer in use. We also reclaim the lucky case where a PID was recycled
# but assigned to one of our workers again, so we know we can take it over
if not pid or not _is_pid_in_use(pid) or pid == os.getpid():
available_index = index
break
if available_index is not None:
child_pids[available_index] = os.getpid()
else:
# This is unexpected - it means all of the PIDs are in use so we have a logical error
# or a PID was recycled before we could notice and reclaim its index
pass
def _is_pid_in_use(pid):
try:
os.kill(pid, 0)
return True
except OSError:
return False
To get a better understanding about parallel, I am comparing a set of different pieces of code.
Here is the basic one (code_piece_1).
for loop
import time
# setup
problem_size = 1e7
items = range(9)
# serial
def counter(num=0):
junk = 0
for i in range(int(problem_size)):
junk += 1
junk -= 1
return num
def sum_list(args):
print("sum_list fn:", args)
return sum(args)
start = time.time()
summed = sum_list([counter(i) for i in items])
print(summed)
print('for loop {}s'.format(time.time() - start))
This code ran a time consumer in a serial style (for loop) and got this result
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
for loop 8.7735116481781s
multiprocessing
Could multiprocessing style be viewed as a way to implement parallel computing?
I assume a Yes, since the doc says so.
Here is code_piece_2
import multiprocessing
start = time.time()
pool = multiprocessing.Pool(len(items))
num_to_sum = pool.map(counter, items)
print(sum_list(num_to_sum))
print('pool.map {}s'.format(time.time() - start))
This code ran the same time consumer in multiprocessing style and got this result
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
pool.map 1.6011056900024414s
Obviously, the multiprocessing one is faster than the serial in this particular case.
Dask
Dask is a flexible library for parallel computing in Python.
This code (code_piece_3) ran the same time consumer with Dask (I am not sure whether I use Dask the right way.)
#delayed
def counter(num=0):
junk = 0
for i in range(int(problem_size)):
junk += 1
junk -= 1
return num
#delayed
def sum_list(args):
print("sum_list fn:", args)
return sum(args)
start = time.time()
summed = sum_list([counter(i) for i in items])
print(summed.compute())
print('dask delayed {}s'.format(time.time() - start))
I got
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
dask delayed 10.288054704666138s
my cpu has 6 physical cores
Question
Why does Dask perform so slower while multiprocessing perform so much faster?
Am I using Dask the wrong way? If yes, what is the right way?
Note: Please discuss with this particular case or other specific and concrete cases. Please do NOT talk generally.
In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.
Have a look here for a good overview over the topic: https://docs.dask.org/en/stable/scheduler-overview.html
For your code, you could switch to the multiprocessing backend by calling:
.compute(scheduler='processes').
If you use the multiprocessing backend, all communication between processes still needs to pass through the main process. You therefore might also want to check out the distributed scheduler, where worker processes can directly communicate with each other, which is beneficial especially for complex task graphs. Also, the distributed scheduler supports work-stealing to balance work between processes and has a webinterface providing some diagnostic information about running tasks. It often makes sense to use the distributed scheduler rather than the multirpocessing scheduler even if you only want to compute on a local machine.
Q : Why did parallel computing take longer than a serial one?
Because there are way more instructions loaded onto CPU to get executed ( "awfully" many even before a first step of the instructed / intended block of calculations gets first into the CPU ), then in a pure-[SERIAL] case, where no add-on costs were added to the flow-of-execution.
For these (hidden from the source-code) add-on operations ( that you pay both in [TIME]-domain ( duration of such "preparations" ) and in [SPACE]-domain ( allocating more RAM to contain all involved structures needed for [PARALLEL]-operated code ( well, most often a still just-[CONCURRENT]-operated code, if we are pedantic and accurate in terminology ), which again costs you in [TIME], as each and every RAM-I/O costs you about more than 1/3 of [us] ~ 300~380 [ns] )
The result?
Unless your workload-package has "sufficient enough" amount of work, that can get executed in parallel ( non-blocking, having no locks, no mutexes, no sharing, no dependencies, no I/O, ... indeed independent having minimum RAM-I/O re-fetches ), it is very easy to "pay way more than you ever get back".
For details on the add-on costs and things that have such strong effect on resulting Speedup, start reading the criticism of blind using the original, overhead naive formulation of the Amdahl's law here.
The code you have requires the GIL, so only one task is running at a time, and all you are getting is extra overhead. If you use, for example, the distributed scheduler with processes, then you get much better performance.
I am trying to use multiprocessing for the first time and having some fairly basic issues. I have a toy example below, where two processes are adding data to a list:
def add_process(all_nums_class, numbers_to_add):
for number in numbers_to_add:
all_nums_class.all_nums_list.append(number)
class AllNumsClass:
def __init__(self):
self.all_nums_list = []
all_nums_class = AllNumsClass()
p1 = Process(target=add_process, args=(all_nums_class, [1,3,5]))
p1.start()
p2 = Process(target=add_process, args=(all_nums_class, [2,4,6]))
p2.start()
all_nums_class.all_nums_list
I'd like to have the all_nums_class shared between these processes so that they can both add to its all_nums_list - so the result should be
[1,2,3,4,5,6]
instead of what I'm currently getting which is just good old
[]
Could anybody please advise? I have played around with namespace a bit but I haven't yet made it work here.
I feel I'd better mention (in case it makes a difference) that I'm doing this on Jupyter notebook.
You can either use a multiprocessing Queue or a Pipe to share data between processes. Queues are both thread and process safe. You will have to be more careful when using a Pipe as the data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time. Of course there is no risk of corruption from processes using different ends of the pipe at the same time.
Currently, your implementation spawns two separate processes each with its own self.all_nums_list. So you're essentially spawning three objects of AllNumsClass: One in your main program, one in p1, and one in p2. Since processes are independent and don't share the same memory space, they are appending correctly but its appending to its own self.all_nums_list for each process. That's why when you print all_nums_class.all_nums_list in your main program, you're printing the main processes' self.all_nums_list which is a empty list. To share the data and have the processes append to the same list, I would recommend using a Queue.
Example using Queue and Process
import multiprocessing as mp
def add_process(queue, numbers_to_add):
for number in numbers_to_add:
queue.put(number)
class AllNumsClass:
def __init__(self):
self.queue = mp.Queue()
def get_queue(self):
return self.queue
if __name__ == '__main__':
all_nums_class = AllNumsClass()
processes = []
p1 = mp.Process(target=add_process, args=(all_nums_class.get_queue(), [1,3,5]))
p2 = mp.Process(target=add_process, args=(all_nums_class.get_queue(), [2,4,6]))
processes.append(p1)
processes.append(p2)
for p in processes:
p.start()
for p in processes:
p.join()
output = []
while all_nums_class.get_queue().qsize() > 0:
output.append(all_nums_class.get_queue().get())
print(output)
This implementation is asynchronous as it does not apply in sequential order. Every time you run it, you may get different outputs.
Example outputs
[1, 2, 3, 5, 4, 6]
[1, 3, 5, 2, 4, 6]
[2, 4, 6, 1, 3, 5]
[2, 1, 4, 3, 5, 6]
A simpler way to maintain an ordered or unordered list of results is to use the mp.Pool class. Specifically, the Pool.apply and the Pool.apply_async functions. Pool.apply will lock the main program until all processes are finished, which is quite useful if we want to obtain results in a particular order for certain applications. In contrast, Pool.apply_async will submit all processes at once and retrieve the results as soon as they are finished. An additional difference is that we need to use the get method after the Pool.apply_async call in order to obtain the return values of the finished processes.
I have code that makes unique combinations of elements. There are 6 types, and there are about 100 of each. So there are 100^6 combinations. Each combination has to be calculated, checked for relevance and then either be discarded or saved.
The relevant bit of the code looks like this:
def modconffactory():
for transmitter in totaltransmitterdict.values():
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
Now this takes a long time and that is fine, but now I realize this process (making the configurations and then calculations for later use) is only using 1 of my 8 processor cores at a time.
I've been reading up about multithreading and multiprocessing, but I only see examples of different processes, not how to multithread one process. In my code I call two functions: 'dosomethingwith()' and 'saveforlateruse_if_useful()'. I could make those into separate processes and have those run concurrently to the for-loops, right?
But what about the for-loops themselves? Can I speed up that one process? Because that is where the time consumption is. (<-- This is my main question)
Is there a cheat? for instance compiling to C and then the os multithreads automatically?
I only see examples of different processes, not how to multithread one process
There is multithreading in Python, but it is very ineffective because of GIL (Global Interpreter Lock). So if you want to use all of your processor cores, if you want concurrency, you have no other choice than use multiple processes, which can be done with multiprocessing module (well, you also could use another language without such problems)
Approximate example of multiprocessing usage for your case:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(generator, step, offset, conn):
"""
Function to be invoked by every worker process.
generator: iterable object, the very top one of all you are iterating over,
in your case, totalrecieverdict.values()
We are passing a whole iterable object to every worker, they all will iterate
over it. To ensure they will not waste time by doing the same things
concurrently, we will assume this: each worker will process only each stepTH
item, starting with offsetTH one. step must be equal to the WORKERS_NUMBER,
and offset must be a unique number for each worker, varying from 0 to
WORKERS_NUMBER - 1
conn: a multiprocessing.Connection object, allowing the worker to communicate
with the main process
"""
for i, transmitter in enumerate(generator):
if i % step == offset:
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
conn.send('done')
def modconffactory():
"""
Function to launch all the worker processes and wait until they all complete
their tasks
"""
processes = []
generator = totaltransmitterdict.values()
for i in range(WORKERS_NUMBER):
conn, childConn = multiprocessing.Pipe()
process = multiprocessing.Process(target=modconffactoryProcess, args=(generator, WORKERS_NUMBER, i, childConn))
process.start()
processes.append((process, conn))
# Here we have created, started and saved to a list all the worker processes
working = True
finishedProcessesNumber = 0
try:
while working:
for process, conn in processes:
if conn.poll(): # Check if any messages have arrived from a worker
message = conn.recv()
if message == 'done':
finishedProcessesNumber += 1
if finishedProcessesNumber == WORKERS_NUMBER:
working = False
except KeyboardInterrupt:
print('Aborted')
You can adjust WORKERS_NUMBER to your needs.
Same with multiprocessing.Pool:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(transmitter):
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
def modconffactory():
pool = multiprocessing.Pool(WORKERS_NUMBER)
pool.map(modconffactoryProcess, totaltransmitterdict.values())
You probably would like to use .map_async instead of .map
Both snippets do the same, but I would say in the first one you have more control over the program.
I suppose the second one is the easiest, though :)
But the first one should give you the idea of what is happening in the second one
multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html
you can run your function in this way:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers