Python multiprocessing only uses 3 cores, totaling 130% CPU? - python

I have 24 cores on my machine, but I just can't get them all running. When I top, only 3 processes are running, and usually only one hits 100% CPU, the other two ~30%.
I've read all the related threads on this site, but still can't figure out what's wrong with my code.
Pseudocode of how I used pool is as follows
import multiprocessing as mp
def Foo():
pool = mp.Pool(mp.cpu_count())
def myCallbackFun():
pool.map(myFunc_wrapper, myArgs)
optimization(callback=myCallbackFun) # scipy optimization that has a callback function.
Using pdb, I stopped before optimization, and checked I indeed have 24 workers.
But when I resume the program, top tells me I only have three Python processes running. Another thing is, when I ctrl-c to terminate my program, it has soooo many workers to interrupt (e.g., PoolWorker-367) -- I've pressing ctrl-c for minutes, but there are still workers out there. Shouldn't there be just 24 workers?
How to make my program use all CPUs?

With multiprocessing Python starts new processes. With a script like yours it will fork infinitely. You need to wrap the script part of your module like this:
import multiprocessing as mp
if __name__ == '__main__':
pool = mp.Pool(24)
pool.map(myFunc_wrapper, myArgs)

For future readers --
As #mata correctly points out,
You may be running into an IO bottleneck if your involved arguments
are very big
This is indeed my case. Try to minimize the size of the arguments passed to each process.

Related

Multithreading CPU load

I'm trying to run a program external to Python with multithreading using this code:
def handle_multiprocessing_pool(num_threads: int, partial: Callable, variable: list) -> list:
progress_bar = TqdmBar(len(variable))
with multiprocessing.pool.ThreadPool(num_threads) as pool:
jobs = [
pool.apply_async(partial, (value,), callback=progress_bar.update_progress_bar)
for value in variable
]
pool.close()
processing_results = []
for job in jobs:
processing_results.append(job.get())
pool.join()
return processing_results
The Callable being called here loads an external program (with a C++ back-end), runs it and then extracts some data. Inside its GUI, the external program has an option to run cases in parallel, each case is assigned to a thread, from which I assumed it would be best to work with multithreading (instead of multiprocessing).
The script is running without issues, but I cannot quite manage to utilize the CPU power of our machine efficiently. The machine has 64 cores with 2 threads each. I will list some of my findings about the CPU utilisation.
When I run the cases from the GUI, it manages to utilize 100% CPU power.
When I run the script on 120 threads, it seems like only half of the threads are properly engaged:
The external program allows me to run on two threads, however if I run 60 parallel processes on 2 threads each, the utilisation looks similar.
When I run two similar scripts on 60 threads each, the full CPU power is properly used:
I have read about the Global Interpreter Lock in Python, but the multiprocessing package should circumvent this, right? Before test #4, I was assuming that for some reason the processes were still running on cores and the two threads on each were not able to run concurrently (this seems suggested here: multiprocessing.Pool vs multiprocessing.pool.ThreadPool), but especially the behaviour from #4 above is puzzling me.
I have tried the suggestions here Why does multiprocessing use only a single core after I import numpy? which unfortunately did not solve the problem.

multiprocessing.Pool spawning more processes than requested only on Google Cloud

I am using Python's multiprocessing.Pool class to distribute tasks among processes.
The simple case works as expected:
from multiprocessing import Pool
def evaluate:
do_something()
pool = Pool(processes=N)
for task in tasks:
pool.apply_async(evaluate, (data,))
N processes are spawned, and they continually work through the tasks that I pass into apply_async. Now, I have another case where I have many different very complex objects which each need to do computationally heavy activity. I initially let each object create its own multiprocessing.Pool on demand at the time it was completing work, but I eventually ran into OSError for having too many files open, even though I would have assumed that the pools would get garbage collected after use.
At any rate, I decided it would be preferable anyway for each of these complex objects to share the same Pool for computations:
from multiprocessing import Pool
def evaluate:
do_something()
pool = Pool(processes=N)
class ComplexClass:
def work:
for task in tasks:
self.pool.apply_async(evaluate, (data,))
objects = [ComplexClass() for i in range(50)]
for complex in objects:
complex.pool = pool
while True:
for complex in objects:
complex.work()
Now, when I run this on one of my computers (OS X, Python=3.4), it works just as expected. N processes are spawned, and each complex object distributes their tasks among each of them. However, when I ran it on another machine (Google Cloud instance running Ubuntu, Python=3.5), it spawns an enormous number of processes (>> N) and the entire program grinds to a halt due to contention.
If I check the pool for more information:
import random
random_object = random.sample(objects, 1)
print (random_object.pool.processes)
>>> N
Everything looks correct. But it's clearly not. Any ideas what may be going on?
UPDATE
I added some additional logging. I set the pool size to 1 for simplicity. Within the pool, as a task is being completed, I print the current_process() from the multiprocessing module, as well as the pid of the task using os.getpid(). It results in something like this:
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
...
Again, looking at actually activity using htop, I'm seeing many processes (one per object sharing the multiprocessing pool) all consuming CPU cycles as this is happening, resulting in so much OS contention that progress is very slow. 5122 appears to be the parent process.
1. Infinite Loop implemented
If you implement an infinite loop, then it will run like an infinite loop.
Your example (which does not work at all due to other reasons) ...
while True:
for complex in objects:
complex.work()
2. Spawn or Fork Processes?
Even though your code above shows only some snippets, you cannot expect the same results on Windows / MacOS on the one hand and Linux on the other. The former spawn processes, the latter fork them. If you use global variables which can have state, you will run into troubles when developing on one environment and running on the other.
Make sure, not to use global statefull variables in your processes. Just pass them explicitly or get rid of them in another way.
3. Use a Program, not a Script
Write a program with the minimal requirement to have a __main__. Especially, when you use Multiprocessing you need this. Instantiate your Pool in that namespace.
1) Your question contains code which is different from what you run (Code in question has incorrect syntax and cannot be run at all).
2) multiprocessing module is extremely bad in error handling/reporting for errors that happen in workers.
The problem is very likely in code that you don't show. Code you show (if fixed) will just work forever and eat CPU, but it will not cause errors with too many open files or processes.

python multiprocessing Pool vs Process?

Just being noob in this context:
I am try to run one function in multiple processes so I can process a huge file in shorter time
I tried
for file_chunk in file_chunks:
p = Process(target=my_func, args=(file_chunk, my_arg2))
p.start()
# without .join(), otherwise main proc has to wait
# for proc1 to finish so it can start proc2
but it seemed not so really fast enough
now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.
am I missing something in my code above or the processes that are created this way (like above) run really paralelly?
The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.
Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.
Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.
I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.
from multiprocessing import Pool
pool = Pool()
for file_chunk in file_chunks:
pool.apply_async(my_func, args=(file_chunk, arg1, arg2))
I am not an expert either, but what you should try is using joblib Parallel
from joblib import Parallel, delayed
import multiprocessing as mp
def random_function(args):
pass
proc = mp.cpu_count()
Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list)
This will run a certain function (random_function) using a number of available cpus (n_jobs).
Feel free to read the docs!

How do I access all computer cores for computation in python script?

I have a python script that has to take many permutations of a large dataset, score each permutation, and retain only the highest scoring permutations. The dataset is so large that this script takes almost 3 days to run.
When I check my system resources in windows, only 12% of my CPU is being used and only 4 out of 8 cores are working at all. Even if I put the python.exe process at highest priority, this doesn't change.
My assumption is that dedicating more CPU usage to running the script could make it run faster, but my ultimate goal is to reduce the runtime by at least half. Is there a python module or some code that could help me do this? As an aside, does this sound like a problem that could benefit from a smarter algorithm?
Thank you in advance!
There are a few ways to go about this, but check out the multiprocessing module. This is a standard library module for creating multiple processes, similar to threads but without the limitations of the GIL.
You can also look into the excellent Celery library. This is a distrubuted task queue, and has a lot of great features. Its a pretty easy install, and easy to get started with.
I can answer a HOW-TO with a simple code sample. While this is running, run /bin/top and see your processes. Simple to do. Note, I've even included how to clean up afterwards from a keyboard interrupt - without that, your subprocesses will keep running and you'll have to kill them manually.
from multiprocessing import Process
import traceback
import logging
import time
class AllDoneException(Exception):
pass
class Dum(object):
def __init__(self):
self.numProcesses = 10
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
self.logger.addHandler(logging.StreamHandler())
def myRoutineHere(self, processNumber):
print "I'm in process number %d" % (processNumber)
time.sleep(10)
# optional: raise AllDoneException
def myRoutine(self):
plist = []
try:
for pnum in range(0, self.numProcesses):
p = Process(target=self.myRoutineHere, args=(pnum, ))
p.start()
plist.append(p)
while 1:
isAliveList = [p.is_alive() for p in plist]
if not True in isAliveList:
break
time.sleep(1)
except KeyboardInterrupt:
self.logger.warning("Caught keyboard interrupt, exiting.")
except AllDoneException:
self.logger.warning("Caught AllDoneException, Exiting normally.")
except:
self.logger.warning("Caught Exception, exiting: %s" % (traceback.format_exc()))
for p in plist:
p.terminate()
d = Dum()
d.myRoutine()
You should spawn new processes instead of threads to utilize cores in your CPU. My general rule is one process per core. So you split your problem input space into the number of cores available, each process getting part of the problem space.
Multiprocessing is best for this. You could also use Parallel Python.
Very late to the party - but in addition to using multiprocessing module as reptilicus said, also make sure to set "affinity".
Some python modules fiddle with it, effectively lowering the number of cores available to Python:
https://stackoverflow.com/a/15641148/4195846
Due to Global Interpreter Lock one Python process cannot take advantage of multiple cores. But if you can somehow parallelize your problem (which you should do anyway), then you can use multiprocessing to spawn as many Python processes as you have cores and process that data in each subprocess.

Persistent Processes Post Python Pool

I have a Python program that takes around 10 minutes to execute. So I use Pool from multiprocessing to speed things up:
from multiprocessing import Pool
p = Pool(processes = 6) # I have an 8 thread processor
results = p.map( function, argument_list ) # distributes work over 6 processes!
It runs much quicker, just from that. God bless Python! And so I thought that would be it.
However I've noticed that each time I do this, the processes and their considerably sized state remain, even when p has gone out of scope; effectively, I've created a memory leak. The processes show up in my System Monitor application as Python processes, which use no CPU at this point, but considerable memory to maintain their state.
Pool has functions close, terminate, and join, and I'd assume one of these will kill the processes. Does anyone know which is the best way to tell my pool p that I am finished with it?
Thanks a lot for your help!
From the Python docs, it looks like you need to do:
p.close()
p.join()
after the map() to indicate that the workers should terminate and then wait for them to do so.

Categories