This question already has answers here:
python multi-threading slower than serial?
(3 answers)
Closed 7 years ago.
I have the following task that I would like to make faster via multi threading (python3).
import threading, time
q = []
def fill_list():
global q
while True:
q.append(1)
if len(q) >= 1000000000:
return
The first main does not utilize multithreading:
t1 = time.clock()
fill_list()
tend = time.clock() - t1
print(tend)
And results in 145 seconds of run time.
The second invokes two threads:
t1 = time.clock()
thread1 = threading.Thread(target=fill_list, args=())
thread2 = threading.Thread(target=fill_list, args=())
thread1.start()
thread2.start()
thread1.join()
thread2.join()
tend = time.clock() - t1
print(tend)
This takes 152 seconds to complete.
Finally, I added a third thread.
t1 = time.clock()
thread1 = threading.Thread(target=fill_list, args=())
thread2 = threading.Thread(target=fill_list, args=())
thread3 = threading.Thread(target=fill_list, args=())
thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()
tend = time.clock() - t1
print(tend)
And this took 233 seconds to complete.
Obviously the more threads I add, the longer the process takes, though I am not sure why. Is this a fundamental misunderstanding of multithreading, or is there a bug in my code that is simply repeating the task multiple times instead of contributing to the same task?
Answers 1 and 2.
First of all, your task is CPU-bound, and in a Python process only one thread may be running CPU-bound Python code at any given time (this is due to the Global Interpreter Lock: https://wiki.python.org/moin/GlobalInterpreterLock ). Since it costs quite a bit of CPU to switch threads (and the more threads you have, the more often you have to pay that cost), your program doesn't speed up: it slows down.
Second, no matter what language you're using, you're modifying one object (a list) from multiple threads. But to guarantee that this does not corrupt the object, access must be synchronized. In other words, only one thread may be modifying it at any given time. Python does it automatically (thanks in part to the aforementioned GIL), but in another lower-level language like C++ you'd have to use a lock or risk memory corruption.
The optimal way to parallelize tasks across threads is to ensure that the threads are as isolated as possible. If they access shared objects, those should be read-only, and cross-thread writes should happen as infrequently as possible, through thread-aware data structures such as message queues.
(which is why the most performant parallel systems like Erlang and Clojure place such a high emphasis on immutable data structures and message-passing)
Related
I have realized that my multithreading program isn't doing what I think its doing. The following is a MWE of my strategy. In essence I'm creating nThreads threads but only actually using one of them. Could somebody help me understand my mistake and how to fix it?
import threading
import queue
NPerThread = 100
nThreads = 4
def worker(q: queue.Queue, oq: queue.Queue):
while True:
l = []
threadIData = q.get(block=True)
for i in range(threadIData["N"]):
l.append(f"hello {i} from thread {threading.current_thread().name}")
oq.put(l)
q.task_done()
threadData = [{} for i in range(nThreads)]
inputQ = queue.Queue()
outputQ = queue.Queue()
for threadI in range(nThreads):
threadData[threadI]["thread"] = threading.Thread(
target=worker, args=(inputQ, outputQ),
name=f"WorkerThread{threadI}"
)
threadData[threadI]["N"] = NPerThread
threadData[threadI]["thread"].setDaemon(True)
threadData[threadI]["thread"].start()
for threadI in range(nThreads):
# start and end are in units of 8 bytes.
inputQ.put(threadData[threadI])
inputQ.join()
outData = [None] * nThreads
count = 0
while not outputQ.empty():
outData[count] = outputQ.get()
count += 1
for i in outData:
assert len(i) == NPerThread
print(len(i))
print(outData)
edit
I only actually realised that I had made this mistake after profiling. Here's the output, for information:
In your sample program, the worker function is just executing so fast that the same thread is able to dequeue every item. If you add a time.sleep(1) call to it, you'll see other threads pick up some of the work.
However, it is important to understand if threads are the right choice for your real application, which presumably is doing actual work in the worker threads. As #jrbergen pointed out, because of the GIL, only one thread can execute Python bytecode at a time, so if your worker functions are executing CPU-bound Python code (meaning not doing blocking I/O or calling a library that releases the GIL), you're not going to get a performance benefit from threads. You'd need to use processes instead in that case.
I'll also note that you may want to use concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.ThreadPool for an out-of-the-box thread pool implementation, rather than creating your own.
I have a similiar and simple computation task with three different parameters. So I take this chance to test how much time I can save by using multithreading.
Here is my code:
import threading
import time
from Crypto.Hash import MD2
def calc_func(text):
t1 = time.time()
h = MD2.new()
total = 10000000
old_text =text
for n in range(total):
h.update(text)
text = h.hexdigest()
print(f"thread done: old_text={old_text} new_text={text}, time={time.time()-t1}sec")
def do_3threads():
t0 = time.time()
texts = ["abcd", "abcde", "abcdef"]
ths = []
for text in texts:
th = threading.Thread(target=calc_func, args=(text,))
th.start()
ths.append(th)
for th in ths:
th.join()
print(f"main done: {time.time()-t0}sec")
def do_single():
texts = ["abcd", "abcde", "abcdef"]
for text in texts:
calc_func(text)
if __name__ == "__main__":
print("=== 3 threads ===")
do_3threads()
print("=== 1 thread ===")
do_single()
The result is astonishing, each thread is taking roughly 4x time it takes if single threaded:
=== 3 threads ===
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=25.460321187973022sec
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=25.47859835624695sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=25.4807870388031sec
main done: 25.481309175491333sec
=== 1 thread ===
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=6.393985033035278sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=6.5472939014434814sec
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=6.483690977096558sec
This is totally not what I expected. This task is obviously a CPU intensive task, so I expect that, with multithreading, each thread could just take around 6.5 seconds and the whole process takes slightly over that, instead it took actually ~25.5 seconds, even worse than single threaded mode, which is ~20seconds.
The environment is python 3.7.7, macos 10.15.5, CPU is 8-core Intel i9, 16G memory.
Can someone explain that to me? Any input is appreciated.
This task is obviously a CPU intensive task
Multithreading is not the proper tool for CPU bound tasks, but rather for something like network requests. This is because each Python process is limited to a single core due to the Global Interpreter Lock (GIL). All threads spawned by a process will run on the same core as the parent process.
Multiprocessing is what you are looking for, as it allows you to spawn multiple processes on, potentially, multiple cores.
Recently,I tried to use asyncio to execute multiple blocking operations asynchronously.I used the function loop.run_in_executor,It seems that the function puts tasks into the thread pool.As far as I know about thread pool,it reduces the overhead of creating and destroying threads,because it can put in a new task when a task is finished instead of destroying the thread.I wrote the following code for deeper unstanding.
def blocking_funa():
print('starta')
print('starta')
time.sleep(4)
print('enda')
def blocking_funb():
print('startb')
print('startb')
time.sleep(4)
print('endb')
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(None, blocking_funa), loop.run_in_executor(None, blocking_funb)]
loop.run_until_complete(asyncio.wait(tasks))
and the output:
starta
startbstarta
startb
(wait for about 4s)
enda
endb
we can see these two tasks are almost simultaneous.now I use threading module:
threads = [threading.Thread(target = blocking_ioa), threading.Thread(target = blocking_iob)]
for thread in threads:
thread.start()
thread.join()
and the output:
starta
starta
enda
startb
startb
endb
Due to the GIL limitation, only one thread is executing at the same time,so I understand the output.But how does thread pool executor make these two tasks almost simultaneous.What is the different between thread pool and thread?And Why does thread pool look like it's not limited by GIL?
You're not making a fair comparison, since you're joining the first thread before starting the second.
Instead, consider:
import time
import threading
def blocking_funa():
print('a 1')
time.sleep(1)
print('a 2')
time.sleep(1)
print('enda (quick)')
def blocking_funb():
print('b 1')
time.sleep(1)
print('b 2')
time.sleep(4)
print('endb (a few seconds after enda)')
threads = [threading.Thread(target=blocking_funa), threading.Thread(target=blocking_funb)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
The output:
a 1
b 1
b 2
a 2
enda (quick)
endb (a few seconds after enda)
Considering it hardly takes any time to run a print statement, you shouldn't read too much into the prints in the first example getting mixed up.
If you run the code repeatedly, you may find that b 2 and a 2 will change order more or less randomly. Note how in my posted result, b 2 occurred before a 2.
Also, regarding your remark "Due to the GIL limitation, only one thread is executing at the same time" - you're right that the "execution of any Python bytecode requires acquiring the interpreter lock. This prevents deadlocks (as there is only one lock) and doesn’t introduce much performance overhead. But it effectively makes any CPU-bound Python program single-threaded." https://realpython.com/python-gil/#the-impact-on-multi-threaded-python-programs
The important part there is "CPU-bound" - of course you would still benefit from making I/O-bound code multi-threaded.
Python does a release/acquire on the GIL often. This means that runnable GIL controlled threads will all get little sprints. Its not parallel, just interleaved. More important for your example, python tends to release the GIL when doing a blocking operation. The GIL is released before sleep and also when print enters the C libraries.
I have the following script which utilizes threading module in order to save time when doing cycle.
import threading, time, sys
def cycle(start, end):
for i in range(start, end):
pass
#########################################################
thread1 = threading.Thread(target = cycle, args=(1,1000000))
thread2 = threading.Thread(target = cycle, args=(1000001,2000000))
thread1.start()
thread2.start()
print 'start join'
thread1.join()
thread2.join()
print 'end join'
However, I found the the script cost even more time than the one without multi-threads (cycle(1, 2000000)).
What might be the reason and how can I save time?
Threads are often not useful in Python because of the global interpreter lock: only one thread can run Python code at a time.
There are cases where the GIL doesn't cause much of a bottleneck, e.g. if your threads are spending most of their time calling thread-safe native (non-Python) functions, but your program doesn't appear to be one of those cases. So even with two threads, you're basically running just one thread at a time, plus there's the overhead of two threads contending for a lock.
Good day!
I'm trying to learn multithreading features in python and I wrote the following code:
import time, argparse, threading, sys, subprocess, os
def item_fun(items, indices, lock):
for index in indices:
items[index] = items[index]*items[index]*items[index]
def map(items, cores):
count = len(items)
cpi = count/cores
threads = []
lock = threading.Lock()
for core in range(cores):
thread = threading.Thread(target=item_fun, args=(items, range(core*cpi, core*cpi + cpi), lock))
threads.append(thread)
thread.start()
item_fun(items, range((core+1)*cpi, count), lock)
for thread in threads:
thread.join()
parser = argparse.ArgumentParser(description='cube', usage='%(prog)s [options] -n')
parser.add_argument('-n', action='store', help='number', dest='n', default='1000000', metavar = '')
parser.add_argument('-mp', action='store_true', help='multi thread', dest='mp', default='True')
args = parser.parse_args()
items = range(NUMBER_OF_ITEMS)
# print 'items before:'
# print items
mp = args.mp
if mp is True:
NUMBER_OF_PROCESSORS = int(os.getenv("NUMBER_OF_PROCESSORS"))
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
map(items, NUMBER_OF_PROCESSORS)
end = time.time()
else:
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
item_fun(items, range(NUMBER_OF_ITEMS), None)
end = time.time()
#print 'items after:'
#print items
print 'time elapsed: ', (end - start)
When I use mp argument, it works slower, on my machine with 4 cpus, it takes about 0.5 secs to compute result, while if I use a single thread it takes about 0.3 secs.
Am I doing something wrong?
I know there's Pool.map() and e.t.c but it spawns subprocess not threads and it works faster as far as I know, but I'd like to write my own thread pool.
Python has no true multithreading, due to an implementation detail called the "GIL". Only one thread actually runs at a time, and Python switches between the threads. (Third party implementations of Python, such as Jython, can actually run parallel threads.)
As to why actually your program is slower in the multithreaded version depends, but when coding for Python, one needs to be aware of the GIL, so one does not believe that CPU bound loads are more efficiently processed by adding threads to the program.
Other things to be aware of are for instance multiprocessing and numpy for solving CPU bound loads, and PyEv (minimal) and Tornado (huge kitchen sink) for solving I/O bound loads.
You'll only see an increase in throughput with threads in Python if you have threads which are IO bound. If what you're doing is CPU bound then you won't see any throughput increase.
Turning on the thread support in Python (by starting another thread) also seems to make some things slower so you may find that overall performance still suffers.
This is all cpython of course, other Python implementations have different behaviour.