Just being noob in this context:
I am try to run one function in multiple processes so I can process a huge file in shorter time
I tried
for file_chunk in file_chunks:
p = Process(target=my_func, args=(file_chunk, my_arg2))
p.start()
# without .join(), otherwise main proc has to wait
# for proc1 to finish so it can start proc2
but it seemed not so really fast enough
now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.
am I missing something in my code above or the processes that are created this way (like above) run really paralelly?
The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.
Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.
Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.
I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.
from multiprocessing import Pool
pool = Pool()
for file_chunk in file_chunks:
pool.apply_async(my_func, args=(file_chunk, arg1, arg2))
I am not an expert either, but what you should try is using joblib Parallel
from joblib import Parallel, delayed
import multiprocessing as mp
def random_function(args):
pass
proc = mp.cpu_count()
Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list)
This will run a certain function (random_function) using a number of available cpus (n_jobs).
Feel free to read the docs!
Related
I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!
If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading
I have found the following lines of code to compute the array mx by the repeated calling of a function called fun.
However, I would like to understand better what it does.
Also, I assigned 16 cores to the parallel pool, however, I noticed that during computations no more than 2 cores are running at the same time.
Could someone explain what this code does and why it could be that only part of the threads is working?
Thank you!
from tqdm import tqdm
from multiprocessing import Pool
from functools import partial
with Pool(processes = 16) as p_mx:
mx = tqdm(p_mx.imap(partial(fun, L), nodes), total = n)
multiprocessing.Pool() slower than just using ordinary functions
The function you are trying to parallelize doesn't require enough CPU
resources (i.e. CPU time) to rationalize parallelization!
And may caused by the way Python handle multi-threading and multi-processing with the GIL:
When to use threading and how many threads to use
Look at the GIL, you will have a better understanding of why.
If you want concurrent code in Python 3.8, you have CPU-bound concurrency problems then this could be the ticket!
I have 24 cores on my machine, but I just can't get them all running. When I top, only 3 processes are running, and usually only one hits 100% CPU, the other two ~30%.
I've read all the related threads on this site, but still can't figure out what's wrong with my code.
Pseudocode of how I used pool is as follows
import multiprocessing as mp
def Foo():
pool = mp.Pool(mp.cpu_count())
def myCallbackFun():
pool.map(myFunc_wrapper, myArgs)
optimization(callback=myCallbackFun) # scipy optimization that has a callback function.
Using pdb, I stopped before optimization, and checked I indeed have 24 workers.
But when I resume the program, top tells me I only have three Python processes running. Another thing is, when I ctrl-c to terminate my program, it has soooo many workers to interrupt (e.g., PoolWorker-367) -- I've pressing ctrl-c for minutes, but there are still workers out there. Shouldn't there be just 24 workers?
How to make my program use all CPUs?
With multiprocessing Python starts new processes. With a script like yours it will fork infinitely. You need to wrap the script part of your module like this:
import multiprocessing as mp
if __name__ == '__main__':
pool = mp.Pool(24)
pool.map(myFunc_wrapper, myArgs)
For future readers --
As #mata correctly points out,
You may be running into an IO bottleneck if your involved arguments
are very big
This is indeed my case. Try to minimize the size of the arguments passed to each process.
Is there any difference at all (in any way) between creating a pool of processes, or simply looping over a process to create more processes?
What's the difference between this?:
pool = multiprocessing.Pool(5)
pool.apply_async(worker)
pool.join()
and this?:
procs = []
for j in range(5):
p = multiprocessing.Process(worker)
p.start()
procs.append(p)
for p in procs:
p.join()
Will pool be more likely to use more cores/processors?
The apply_async method of a pool will only run the worker function once, on an arbitrarily selected process from the pool, so your two code examples won't do exactly the same thing. To really be equivalent, you'd need to call apply_async five times.
I think which of the approaches is more appropriate to a give task depends a bit on what you are doing. multiprocessing.Pool allows you to do multiple jobs per process, which may make it easier to parallelize your program. For instance, if you have a million items that need individual processing, you can create a pool with a reasonable number of processes (perhaps as many as you have CPU cores) and then pass the list of the million items to pool.map. The pool will distribute them to the various worker processes (and collecting up the return values to be returned to the parent process). Launching a million separate processes would be much less practical (it would probably break your OS).
On the other hand, if you have a small number of jobs to do in parallel, and you only need each job done once, it may be perfectly reasonable to use a separate multiprocessing.Process for each job, rather than setting up a pool, launching the jobs then tearing down the pool.
I have a Python program that takes around 10 minutes to execute. So I use Pool from multiprocessing to speed things up:
from multiprocessing import Pool
p = Pool(processes = 6) # I have an 8 thread processor
results = p.map( function, argument_list ) # distributes work over 6 processes!
It runs much quicker, just from that. God bless Python! And so I thought that would be it.
However I've noticed that each time I do this, the processes and their considerably sized state remain, even when p has gone out of scope; effectively, I've created a memory leak. The processes show up in my System Monitor application as Python processes, which use no CPU at this point, but considerable memory to maintain their state.
Pool has functions close, terminate, and join, and I'd assume one of these will kill the processes. Does anyone know which is the best way to tell my pool p that I am finished with it?
Thanks a lot for your help!
From the Python docs, it looks like you need to do:
p.close()
p.join()
after the map() to indicate that the workers should terminate and then wait for them to do so.