Multithreading works slower

Multithreading works slower - python

Good day!
I'm trying to learn multithreading features in python and I wrote the following code:
import time, argparse, threading, sys, subprocess, os
def item_fun(items, indices, lock):
for index in indices:
items[index] = items[index]*items[index]*items[index]
def map(items, cores):
count = len(items)
cpi = count/cores
threads = []
lock = threading.Lock()
for core in range(cores):
thread = threading.Thread(target=item_fun, args=(items, range(core*cpi, core*cpi + cpi), lock))
threads.append(thread)
thread.start()
item_fun(items, range((core+1)*cpi, count), lock)
for thread in threads:
thread.join()
parser = argparse.ArgumentParser(description='cube', usage='%(prog)s [options] -n')
parser.add_argument('-n', action='store', help='number', dest='n', default='1000000', metavar = '')
parser.add_argument('-mp', action='store_true', help='multi thread', dest='mp', default='True')
args = parser.parse_args()
items = range(NUMBER_OF_ITEMS)
# print 'items before:'
# print items
mp = args.mp
if mp is True:
NUMBER_OF_PROCESSORS = int(os.getenv("NUMBER_OF_PROCESSORS"))
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
map(items, NUMBER_OF_PROCESSORS)
end = time.time()
else:
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
item_fun(items, range(NUMBER_OF_ITEMS), None)
end = time.time()
#print 'items after:'
#print items
print 'time elapsed: ', (end - start)
When I use mp argument, it works slower, on my machine with 4 cpus, it takes about 0.5 secs to compute result, while if I use a single thread it takes about 0.3 secs.
Am I doing something wrong?
I know there's Pool.map() and e.t.c but it spawns subprocess not threads and it works faster as far as I know, but I'd like to write my own thread pool.

Python has no true multithreading, due to an implementation detail called the "GIL". Only one thread actually runs at a time, and Python switches between the threads. (Third party implementations of Python, such as Jython, can actually run parallel threads.)
As to why actually your program is slower in the multithreaded version depends, but when coding for Python, one needs to be aware of the GIL, so one does not believe that CPU bound loads are more efficiently processed by adding threads to the program.
Other things to be aware of are for instance multiprocessing and numpy for solving CPU bound loads, and PyEv (minimal) and Tornado (huge kitchen sink) for solving I/O bound loads.

You'll only see an increase in throughput with threads in Python if you have threads which are IO bound. If what you're doing is CPU bound then you won't see any throughput increase.
Turning on the thread support in Python (by starting another thread) also seems to make some things slower so you may find that overall performance still suffers.
This is all cpython of course, other Python implementations have different behaviour.

Related

Why is my multithreading program only actually using a single thread?

I have realized that my multithreading program isn't doing what I think its doing. The following is a MWE of my strategy. In essence I'm creating nThreads threads but only actually using one of them. Could somebody help me understand my mistake and how to fix it?
import threading
import queue
NPerThread = 100
nThreads = 4
def worker(q: queue.Queue, oq: queue.Queue):
while True:
l = []
threadIData = q.get(block=True)
for i in range(threadIData["N"]):
l.append(f"hello {i} from thread {threading.current_thread().name}")
oq.put(l)
q.task_done()
threadData = [{} for i in range(nThreads)]
inputQ = queue.Queue()
outputQ = queue.Queue()
for threadI in range(nThreads):
threadData[threadI]["thread"] = threading.Thread(
target=worker, args=(inputQ, outputQ),
name=f"WorkerThread{threadI}"
)
threadData[threadI]["N"] = NPerThread
threadData[threadI]["thread"].setDaemon(True)
threadData[threadI]["thread"].start()
for threadI in range(nThreads):
# start and end are in units of 8 bytes.
inputQ.put(threadData[threadI])
inputQ.join()
outData = [None] * nThreads
count = 0
while not outputQ.empty():
outData[count] = outputQ.get()
count += 1
for i in outData:
assert len(i) == NPerThread
print(len(i))
print(outData)
edit
I only actually realised that I had made this mistake after profiling. Here's the output, for information:

In your sample program, the worker function is just executing so fast that the same thread is able to dequeue every item. If you add a time.sleep(1) call to it, you'll see other threads pick up some of the work.
However, it is important to understand if threads are the right choice for your real application, which presumably is doing actual work in the worker threads. As #jrbergen pointed out, because of the GIL, only one thread can execute Python bytecode at a time, so if your worker functions are executing CPU-bound Python code (meaning not doing blocking I/O or calling a library that releases the GIL), you're not going to get a performance benefit from threads. You'd need to use processes instead in that case.
I'll also note that you may want to use concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.ThreadPool for an out-of-the-box thread pool implementation, rather than creating your own.

why python multithreading runs like a single thread on macos?

I have a similiar and simple computation task with three different parameters. So I take this chance to test how much time I can save by using multithreading.
Here is my code:
import threading
import time
from Crypto.Hash import MD2
def calc_func(text):
t1 = time.time()
h = MD2.new()
total = 10000000
old_text =text
for n in range(total):
h.update(text)
text = h.hexdigest()
print(f"thread done: old_text={old_text} new_text={text}, time={time.time()-t1}sec")
def do_3threads():
t0 = time.time()
texts = ["abcd", "abcde", "abcdef"]
ths = []
for text in texts:
th = threading.Thread(target=calc_func, args=(text,))
th.start()
ths.append(th)
for th in ths:
th.join()
print(f"main done: {time.time()-t0}sec")
def do_single():
texts = ["abcd", "abcde", "abcdef"]
for text in texts:
calc_func(text)
if __name__ == "__main__":
print("=== 3 threads ===")
do_3threads()
print("=== 1 thread ===")
do_single()
The result is astonishing, each thread is taking roughly 4x time it takes if single threaded:
=== 3 threads ===
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=25.460321187973022sec
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=25.47859835624695sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=25.4807870388031sec
main done: 25.481309175491333sec
=== 1 thread ===
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=6.393985033035278sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=6.5472939014434814sec
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=6.483690977096558sec
This is totally not what I expected. This task is obviously a CPU intensive task, so I expect that, with multithreading, each thread could just take around 6.5 seconds and the whole process takes slightly over that, instead it took actually ~25.5 seconds, even worse than single threaded mode, which is ~20seconds.
The environment is python 3.7.7, macos 10.15.5, CPU is 8-core Intel i9, 16G memory.
Can someone explain that to me? Any input is appreciated.

This task is obviously a CPU intensive task
Multithreading is not the proper tool for CPU bound tasks, but rather for something like network requests. This is because each Python process is limited to a single core due to the Global Interpreter Lock (GIL). All threads spawned by a process will run on the same core as the parent process.
Multiprocessing is what you are looking for, as it allows you to spawn multiple processes on, potentially, multiple cores.

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?

You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.

As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.

Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on

I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.

Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

Multiprocessing using maximum CPU power in Python-3.x

I'm working on human genome which consists of 3.2 billions of characters and i have a list of objects which need to be searched within this data. Something like this:
result_final=[]
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result_1)
for item in objects:
result_2=function(item)
result_final.append(result_2)
Each object's search within the data takes nearly 30 seconds and i have few thousands of objects. I noticed that while doing this serially just 7% of CPU and 5% of RAM is being used. As i searched, for reducing the computation time i should do parallel computation using queuing , threading or multiprocessing. but they seem complicated for non-experts. could anybody help me how i can code for python to run 10 simultaneous searches and is it possible to make python to use maximum available CPU and RAM for multiprocessing? (I'm using Python33 on windows 7 with 64Gb RAM,COREI7 and 3.5 GH CPU)

You can use the multiprocessing module for this:
from multiprocessing import Pool
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result)
if __name__ == "__main__":
pool = Pool()
result_final = pool.map(function, objects)
This will allow you to scale the work across all available CPUs on your machine, because processes aren't affected by the GIL. You wouldn't want to run too many more tasks than there are CPUs available. Once you do that, you actually start slowing things down, because then the CPUs have to constantly switch between processes, which has a performance penalty.

Ok I'm not sure of your question, but I would do this (Note that there may be a better solution because I'm not an expert with the Queue Object) :
If you want to multithread your searches :
class myThread (threading.Thread):
def __init__(self, obj):
threading.Thread.__init__(self)
self.result = None
self.obj = obj
#Function who is called when you start your Thread
def run(self)
#Execute your function here
self.result = search_in_genome(self.obj)
if __name__ == '__main__':
result_final=[]
objects=['obj1','obj2','obj3',...]
#List of Thread
listThread = []
#Count number of potential thread
allThread = objects.len()
allThreadDone = 0
for item in objects:
#Create one thread
thread = myThread(item)
#Launch that Thread
thread.start()
#Stock it into the list
listThread.append(thread)
while True:
for thread in listThread:
#Count number of Thread who are finished
if thread.result != None:
#If a Thread is finished, count it
allThreadDone += 1
#If all thread are finished, then stop program
if allThreadDone == allThread:
break
#Else initialyse flag to count again
else:
allThreadDone = 0
If someone can check and validate this code that would be better. (Sorry for my english btw)

Python: Why is threaded function slower than non thread

Hello I'm trying to calculate the first 10000 prime numbers.
I'm doing this first non threaded and then splitting the calculation in 1 to 5000 and 5001 to 10000. I expected that the use of threads makes it significant faster but the output is like this:
--------Results--------
Non threaded Duration: 0.012244000000000005 seconds
Threaded Duration: 0.012839000000000017 seconds
There is in fact no big difference except that the threaded function is even a bit slower.
What is wrong?
This is my code:
import math
from threading import Thread
def nonThreaded():
primeNtoM(1,10000)
def threaded():
t1 = Thread(target=primeNtoM, args=(1,5000))
t2 = Thread(target=primeNtoM, args=(5001,10000))
t1.start()
t2.start()
t1.join()
t2.join()
def is_prime(n):
if n % 2 == 0 and n > 2:
return False
for i in range(3, int(math.sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
def primeNtoM(n,m):
L = list()
if (n > m):
print("n should be smaller than m")
return
for i in range(n,m):
if(is_prime(i)):
L.append(i)
if __name__ == '__main__':
import time
print("--------Nonthreaded calculation--------")
nTstart_time = time.clock()
nonThreaded()
nonThreadedTime = time.clock() - nTstart_time
print("--------Threaded calculation--------")
Tstart_time = time.clock()
threaded()
threadedTime = time.clock() - Tstart_time
print("--------Results--------")
print ("Non threaded Duration: ",nonThreadedTime, "seconds")
print ("Threaded Duration: ",threadedTime, "seconds")

from: https://wiki.python.org/moin/GlobalInterpreterLock
In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)
This means: since this is CPU-intensive, and python is not threadsafe, it does not allow you to run multiple bytecodes at once in the same process. So, your threads alternate each other, and the switching overhead is what you get as extra time.

You can use the multiprocessing module, which gives results like below:
('Non threaded Duration: ', 0.016599999999999997, 'seconds')
('Threaded Duration: ', 0.007172000000000005, 'seconds')
...after making just these changes to your code (changing 'Thread' to 'Process'):
import math
#from threading import Thread
from multiprocessing import Process
def nonThreaded():
primeNtoM(1,10000)
def threaded():
#t1 = Thread(target=primeNtoM, args=(1,5000))
#t2 = Thread(target=primeNtoM, args=(5001,10000))
t1 = Process(target=primeNtoM, args=(1,5000))
t2 = Process(target=primeNtoM, args=(5001,10000))
t1.start()
t2.start()
t1.join()
t2.join()
By spawning actual OS processes instead of using in-process threading, you eliminate the GIL issues discussed in #Luis Masuelli's answer.
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine. It runs on both Unix
and Windows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multithreading works slower - python

Related

Why is my multithreading program only actually using a single thread?

why python multithreading runs like a single thread on macos?

Multithreading inside Multiprocessing in Python

Multiprocessing using maximum CPU power in Python-3.x

Python: Why is threaded function slower than non thread

Categories

Resources