subprocess.Popen() performance degrades as process count rises - python

I have an application that runs and manages many services using subprocess.Popen(). Each of these services runs until it is explicitly told to come down. I'm noticing that the time to return from the subprocess.Popen() call increases at a fairly linear rate as more processes are spawned by the arbiter.
My basic code looks like this:
process_list = []
for command in command_list:
start_tm = time.time()
process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
end_tm = time.time()
print end_tm-start_tm
process_list.append(process)
I'm seeing that the print of end_tm-start_tm increases as I spawn more and more processes. The services run by each command can be in any order and I see the same behavior. The time increase isn't completely linear but I keep seeing a pattern: the first process takes ~0.005 seconds to spawn, the 10th takes ~0.125 seconds, the 20th takes ~0.35 seconds, and so on.
My arbiter process runs upwards of 100 subprocesses. I could split it up so that multiple arbiters run with a smaller number of subprocesses each but I want to understand what the issue is first. Is the overhead of one process owning many subprocesses so great that each additional subprocess adds to the return time of subprocess.Popen()? Is there anything I could do to potentially mitigate that?
EDIT: I split my single arbiter process into two. In my previous test, my arbiter was running 64 processes. I created two separate configurations of my arbiter that each ran 32 processes. I ran the first arbiter, let it completely start all 32 processes, then kicked off the second arbiter.
In both cases, the first process again took ~0.005 seconds to start up and the 32nd and final process took about ~0.45 seconds to start up. In my previous test of a single arbiter with 64 processes, the first process took ~0.005 seconds to start while the 64th would take roughly 0.85 seconds.

Not a direct answer to your question about "why" what you are noticing is occurring, but I would highly recommend changing strategies for how you are dealing with your multiprocessing by using ThreadPoolExecutor to manage system resources.
Since your system can't effectively manage more processes than it has system threads, I would try:
>>> from concurrent.futures import ThreadPoolExecutor
>>> from multiprocessing import cpu_count
>>> with ThreadPoolExecutor(workers=cpu_count()) as pool:
results = pool.map(lambda cmd: subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE), command_list)
I find the API to be easy and the resource managment to be quite effective, and there are less "gotchas".
https://docs.python.org/3.6/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor.shutdown
Although this is for 3.6, the API is largely the same once you pip install the concurrent.futures module for 2.7.

Related

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

Unexpected behavior with multiprocessing pool imap_unordered method

I have a process that I'm trying to parallelize with the multiprocessing library. Each process does a few things, but one of the things each process does is to call a function that runs an optimization. Now, 90% of the time each optimization will complete in less than 1 minute, but if it doesn't, it might never converge. There is an internal mechanism that terminates the optimization function if it doesn't converge after say, 20,000 iterations, but that can take a long time (~1hr).
Right now my code looks something like this:
pool = multiprocessing.Pool()
imap_it = pool.imap_unordered(synth_worker, arg_tuples)
while 1:
try:
result = imap_it.next(timeout=120)
process_result(result)
except StopIteration:
break
except multiprocessing.TimeoutError:
pass
pool.close()
pool.join()
This seems to work fairly well as I am able to process the first 90% of results fairly quickly, but towards the end, the longer processes start to bog things down and only 1 process completes every ~10 minutes. The weird thing is that once it gets to this point and I run top at the terminal, it seems like only 5-6 parallel processes are running even though I have 24 CPUs in my Pool. I see all 24 CPU's engaged at the beginning of the run, also. Meanwhile I see that there are still over 100 remaining tasks that haven't finished (or even started). I'm not utilizing any of the built-in chunking options of imap_unordered so my understanding is that as soon as one process finishes, the worker that was running that finished process should pick up a new one. Why would there be fewer parallel processes running than there are remaining tasks unless all remaining jobs were already allocated to the 5-6 jammed workers that are waiting for their non-convergent optimizations to hit their 20,000 iteration cutoff? Any ideas? Should I try another method to iterate over the results of my Pool as they come in?
Output of top at the beginning of a run here
and output of top once the jobs seem to get stuck here
and this is what stdout looks like when the jobs start to get stuck.

RuneTimeError: Can't start new thread, threading library

So, in this code I am testing some multi threading to speed up my code. If I have a large number of tasks in queue I get RuneTimeError: Can't start new thread error. For example, range(0,100) works, but range(0,1000) won't work. I am using threading.Semaphore(4) and this is correctly working, only processing 4 threads at a time, tested this is working. I know why I am getting this error, because even though I am using threading.Semaphore it still technically starts all the threads at the start, but just pauses them until it's the threads turn to run and starting 1000 threads at the same time is to much for the PC to handle. Is there anyway to fix this problem? (Also, yes, I know about GIL)
def thread_test():
threads = []
for t in self.tasks:
t = threading.Thread(target=utils.compareV2.run_compare_temp, args=t)
t.start()
threads.append(t)
for t in threads:
t.join()
for x in range(0,100):
self.tasks.append(("arg1","arg2"))
thread_test()
Instead of starting 1000 threads and then only letting 4 do any work at a time, start 4 threads and let them all do work.
What are the extra 996 threads buying you?
They're using memory and putting pressure on the system's scheduler.
The reason you get the RuntimeError is probably that you've run out of memory for the call stacks for all of those threads. The default limit varies by platform but it's probably something close to 8MiB. A thousand of those and you're up around 8GiB... just for stack space.
You can reduce the amount of memory available for the call stack of each thread with threading.stack_size(...). This will let you fit more threads on your system (but be sure you don't set it below the amount of stack space you actually need or you'll crash your process).
Most likely, what you actually want is a number of processes close to the number of physical cores on your host or a threading system that's more light-weight than what the threading module gives you.

How to make Python's multiprocess spawn to make use all the available CPUs

I have an AWS instance has 32 CPUS:
ubuntu#ip-122-00-18-114:~$ cat /proc/cpuinfo | grep processor | wc -l
32
My question is how can I make use of Python's multiprocessing
so that each command runs on every CPU.
For example with the following code, will each command run on every single CPU available?
import multiprocessing
import os
POOL_SIZE = 32
cmdlist = []
for param in items:
cmd = """./cool_command %s""" % (param)
cmdlist.append(cmd)
p = multiprocessing.Pool(POOL_SIZE)
p.map(os.system, cmdlist)
If not, what's the right way to do it?
And what happened if I set POOL_SIZE > # Processors (CPUs)?
First a little correction on your wording. A CPU has different cores and each cores has hyperthreads. Each hyperthread is the logical unit which runs a processor. On Amazon you have 32 vCPUs which correspond to hyperthreads, not CPUs or cores. This is not important for this question but just in case if you do any further research it is important to have the wording right. I'll refer to this "lowest logical processing unit" of hyperthread as vCPU below
If you do not specify the pool size:
p = multiprocessing.Pool()
p.map(os.system, cmdlist)
then python will find out the number of available logical processors (in your case 32 vCPUs) itself (via os.cpu_count()).
In normal circumstances, all 32 processes run on separate vCPUs because Linux tries to balance the load evenly between them.
If, however there are other heavy processes running at the same time, then two processes might run on the same vCPU.
The key to understand here is how the Linux scheduler works: It periodically reschedules processes so all processing units are utilized about the same. That means if you start only 16 processes then they will spread out to all 32 vCPUs and utilize them about the same (use htop to see how the load spreads).
And what happened if I set POOL_SIZE > # Processors (CPUs)?
If you start more processes than the available vCPUs, then a few processes need to share a vCPU. That means that they a process is periodically switched out in the context switch by the scheduler. If your process is CPU bound (utilized 100% cpu, e.g. when you do number crunching) then having more processes than vCPUs will slow down the overall process as you'll have the context switches which slow down and if you have communication between the processes (not in your example, but something you'd normally do when doing multiprocessing) which slow down as well.
However. If your processes are not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway.
Easy question is "Not exactly". You may get cpu count with os.cpu_count() function and run this number of processes. But only operating system assigns process to the CPU. And more than that - it might switch it to another cpu in some time. I won't explain how preemptive multitasking works here.
If you have other "heavy" processes running on this server - for example database or even web server - they might need some cpu time for their execution as well.
Some good news are that there is a thing named Process Affinity exists that could be of use for your needs. But it is a kind of fine tuning the os.

Why are threads spread between CPUs?

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?
As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf
Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

Categories