I have a python script, which works on the following scheme: read a large file (e.g., movie) - compose selected information from it into a number of small temporary files - spawn in subprocesses a C++ application to perform the files processing/calculations (separately for each file) - read the application output. To speed up the script I used multiprocessing. However, it has major drawback: each process has to maintain in RAM the whole copy of the large input file, and therefore I can run only few processes, as I run out of memory. Thus I decided to try multithreading instead (or some combination of multiprocessing and multithreading) due to the fact that threads share the address space. As the python part most of the time works with file I/O or waits for the C++ application to complete, I thought that GIL must not be an issue here. Nevertheless, instead of some gain in performance I observe drastic slowdown, mainly owing to the I/O part.
I illustrate the problem with the following code (saved as test.py):
import sys, threading, tempfile, time
nthreads = int(sys.argv[1])
class IOThread (threading.Thread):
def __init__(self, thread_id, obj):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.obj = obj
def run(self):
run_io(self.thread_id, self.obj)
def gen_object(nlines):
obj = []
for i in range(nlines):
obj.append(str(i) + '\n')
return obj
def run_io(thread_id, obj):
ntasks = 100 // nthreads + (1 if thread_id < 100 % nthreads else 0)
for i in range(ntasks):
tmpfile = tempfile.NamedTemporaryFile('w+')
with open(tmpfile.name, 'w') as ofile:
for elem in obj:
ofile.write(elem)
with open(tmpfile.name, 'r') as ifile:
content = ifile.readlines()
tmpfile.close()
obj = gen_object(100000)
starttime = time.time()
threads = []
for thread_id in range(nthreads):
threads.append(IOThread(thread_id, obj))
threads[thread_id].start()
for thread in threads:
thread.join()
runtime = time.time() - starttime
print('Runtime: {:.2f} s'.format(runtime))
When I run it with different number of threads, I get this:
$ python3 test.py 1
Runtime: 2.84 s
$ python3 test.py 1
Runtime: 2.77 s
$ python3 test.py 1
Runtime: 3.34 s
$ python3 test.py 2
Runtime: 6.54 s
$ python3 test.py 2
Runtime: 6.76 s
$ python3 test.py 2
Runtime: 6.33 s
Can someone explain me the result, as well as give some advice, how to effectively parallelize I/O using multithreading?
EDIT:
The slowdown is not due to HDD performance, because:
1) the files are getting cached to RAM anyway
2) the same operations with multiprocessing (not multithreading) are indeed getting faster (almost by factor of CPUs number)
As I delved deeper into the problem, I made comparison benchmarks for 4 different parallelisation methods, 3 of which are using python and 1 is using java (the purpose of the test was not to compare I/O machinery between different languages but to see if multithreading can boost I/O operations). The test was performed on Ubuntu 14.04.3, all files were placed to a RAM disk.
Although the data are quite noisy, the clear trend is evident (see the chart; n=5 for each bar, error bars represent SD): python multithreading fails to boost the I/O performance. The most probable reason is GIL, and therefore there is no way around it.
I think your performance measures don't lie: you're asking your hard disk to do many things at the same time. Reads, writes, fsync when closing the files, ... and on several files at the same time. It triggers a lot of hardware physical operations. And the more files you write at the same time, the more contention you get.
So the CPU is waiting for the disk operation to finish...
Moreover, maybe you don't have a SSD hard disk, so the syncs actually mean some physical moves.
EDIT: it could be a GIL problem. When you iterate elem in obj in run_io, you execute python code between each write. The ofile.write probably release the GIL, so that the IO doesnt block the other threads, but the lock is released/acquired with each iteration. So maybe your writes don't really run "concurrently".
EDIT2: to test the hypothesis you can try to replace:
for elem in obj:
ofile.write(elem)
with:
ofile.write("".join(obj))
and see if perf gets better
Related
I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!
several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()
you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.
I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.
In our system we have started to experience problems with a task that was working ok, but now it seems to be hanging and uses a high amount of memory (so other tasks fail and raise MemoryError).
Context: example code below had no problem, but for a new database huge_dataframe became a lot bigger. The question is that if I run both parts separately it works. But running all together in process_data_task will lead to the problem. Running under Python 2.7 on Linux.
I suspect that is something with fork(), but how could each sub-process take so much memory? huge_dataframe is deleted before starting multiprocessing. Also is strange that do_recalculations hangs only when is called inside process_data_task (child not joining?), but it doesn't throw any exception. Any explanation or idea for troubleshooting?
def process_data_task():
# part 1, high memory usage
huge_dataframe = retrieve_data()
process_table(huge_dataframe)
# added this trying to fix the problem but it didn't help
del huge_dataframe
gc.collect()
# part 2, heavy CPU usage, multiprocessing
do_recalculations() # recalculate items in parallel
# Multiprocessing done here
def do_recalculations():
processes = cpu_count()
items_to_update = [...] # query database
work_total_list = chunkify(items_to_update, processes)
p = Pool(processes)
result = p.map(sub_process_func, work_total_list)
p.close()
p.join()
Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.
First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.
I have a program written in python 2.6 that creates a large number of short lived instances (it is a classic producer-consumer problem). I noticed that the memory usage as reported by top and pmap seems to increase when these instances are created and never goes back down. I was concerned that some python module I was using might be leaking memory so I carefully isolated the problem in my code. I then proceeded to reproduce it in as short as example as possible. I came up with this:
class LeaksMemory(list):
timesDelCalled = 0
def __del__(self):
LeaksMemory.timesDelCalled +=1
def leakSomeMemory():
l = []
for i in range(0,500000):
ml = LeaksMemory()
ml.append(float(i))
ml.append(float(i*2))
ml.append(float(i*3))
l.append(ml)
import gc
import os
leakSomeMemory()
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(gc.collect()) +" objects collected")
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(os.getpid()) + " : check memory usage with pmap or top")
If you run this with something like 'python2.6 -i memoryleak.py' it will halt and you can use pmap -x PID to check the memory usage. I added the del method so I could verify that GC was occuring. It is not there in my actual program and does not appear to make any functional difference. Each call to leakSomeMemory() increases the amount of memory consumed by this program. I fear I am making some simple error and that references are getting kept by accident, but cannot identify it.
Python will release the objects, but it will not release the memory back to the operating system immediately. Instead, it will re-use the same segments for future allocations within the same interpreter.
Here's a blog post about the issue: http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
UPDATE: I tested this myself with Python 2.6.4 and didn't notice persistent increases in memory usage. Some invocations of leakSomeMemory() caused the memory footprint of the Python process to increase, and some made it decrease again. So it all depends on how the allocator is re-using the memory.
According to Alex Martelli:
"The only really reliable way to
ensure that a large but temporary use
of memory DOES return all resources to
the system when it's done, is to have
that use happen in a subprocess, which
does the memory-hungry work then
terminates."
So, in your situation it sounds like it would make sense to use the multiprocessing module to run the short-lived functions in separate processes to ensure the return of resources when the process finishes.
import multiprocessing as mp
def NOT_leakSomeMemory():
# do stuff
return result
if __name__=='__main__':
pool = mp.Pool()
results=pool.map(NOT_leakSomeMemory, range(500000))
For more ideas on how to set things up using multiprocessing, see Doug Hellman's tutorial: