Python: Why is threaded function slower than non thread - python

Hello I'm trying to calculate the first 10000 prime numbers.
I'm doing this first non threaded and then splitting the calculation in 1 to 5000 and 5001 to 10000. I expected that the use of threads makes it significant faster but the output is like this:
--------Results--------
Non threaded Duration: 0.012244000000000005 seconds
Threaded Duration: 0.012839000000000017 seconds
There is in fact no big difference except that the threaded function is even a bit slower.
What is wrong?
This is my code:
import math
from threading import Thread
def nonThreaded():
primeNtoM(1,10000)
def threaded():
t1 = Thread(target=primeNtoM, args=(1,5000))
t2 = Thread(target=primeNtoM, args=(5001,10000))
t1.start()
t2.start()
t1.join()
t2.join()
def is_prime(n):
if n % 2 == 0 and n > 2:
return False
for i in range(3, int(math.sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
def primeNtoM(n,m):
L = list()
if (n > m):
print("n should be smaller than m")
return
for i in range(n,m):
if(is_prime(i)):
L.append(i)
if __name__ == '__main__':
import time
print("--------Nonthreaded calculation--------")
nTstart_time = time.clock()
nonThreaded()
nonThreadedTime = time.clock() - nTstart_time
print("--------Threaded calculation--------")
Tstart_time = time.clock()
threaded()
threadedTime = time.clock() - Tstart_time
print("--------Results--------")
print ("Non threaded Duration: ",nonThreadedTime, "seconds")
print ("Threaded Duration: ",threadedTime, "seconds")

from: https://wiki.python.org/moin/GlobalInterpreterLock
In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)
This means: since this is CPU-intensive, and python is not threadsafe, it does not allow you to run multiple bytecodes at once in the same process. So, your threads alternate each other, and the switching overhead is what you get as extra time.

You can use the multiprocessing module, which gives results like below:
('Non threaded Duration: ', 0.016599999999999997, 'seconds')
('Threaded Duration: ', 0.007172000000000005, 'seconds')
...after making just these changes to your code (changing 'Thread' to 'Process'):
import math
#from threading import Thread
from multiprocessing import Process
def nonThreaded():
primeNtoM(1,10000)
def threaded():
#t1 = Thread(target=primeNtoM, args=(1,5000))
#t2 = Thread(target=primeNtoM, args=(5001,10000))
t1 = Process(target=primeNtoM, args=(1,5000))
t2 = Process(target=primeNtoM, args=(5001,10000))
t1.start()
t2.start()
t1.join()
t2.join()
By spawning actual OS processes instead of using in-process threading, you eliminate the GIL issues discussed in #Luis Masuelli's answer.
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine. It runs on both Unix
and Windows.

Related

Threading module

Have a question, I'm some new in threading, I made this code....
import threading
from colorama import *
import random
import os
listax = [Fore.GREEN,Fore.YELLOW,Fore.RED]
print(random.choice(listax))
def hola():
import requests
a = requests.get('https://google.com')
print(a.status_code)
if __name__ == "__main__":
t1 = threading.Thread(target=hola)
t2 = threading.Thread(target=hola)
t3 = threading.Thread(target=hola)
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
And output shows 3 times if I execute 3 times the code, but my question is, for example, if I have big code and all start in:
def main():
code...
How I can add multiple threading for fast work, I see I can add 1 thread, if I add 3 threads the output shows 3 times, but how I can do it for example for add 10 threads to the same task without the output repeating 10 times for this execute fast as possible using the resourses of the system?
Multithreading does not magically sped up your code. It's up to you to break the code in chunks that can be run concurrently. When you create 3 threads that run hola, you are not "running hola once using 3 threads", but you are "running hola three times, each time in a different thread.
Although multithreading can be used to perform computation in parallel, the most common python interpreter (CPython) is implemented using a lock (the GIL) that lets only one thread run at a time. There are libraries that release the GIL before doing CPU-intensive work, so threading in python is useful for doing CPU-intensive work. Moreover, I/O operations relese the gil, so multithreading in python is very well suited for I/O work.
As an example, let's imagine that you have to need to access three different sites. You can access them sequentially, one after the other:
import requests
sites = ['https://google.com', 'https://yahoo.com', 'https://rae.es']
def hola(site):
a = requests.get(site)
print(site, " answered ", a.status_code)
for s in sites:
hola(s)
Or concurrently (all at the same time) using threads:
import requests
import threading
sites = ['https://google.com', 'https://yahoo.com', 'https://rae.es']
def hola(site):
a = requests.get(site)
print(site, " answered ", a.status_code)
th = [threading.Thread(target=hola, args=(s, )) for s in sites]
for t in th:
t.start()
for t in th:
t.join()
Please note that this is a simple example: the output can get scrambled, you have no acces to the return values, etc. For this kind of tasks I would use a thread pool.
i tried to use the loop of the code you give me
# Python program to illustrate the concept
# of threading
# importing the threading module
import threading
from colorama import *
import random
import os
listax = [Fore.GREEN,Fore.YELLOW,Fore.RED]
print(random.choice(listax))
"""
def print_cube(num):
function to print cube of given num
print("Cube: {}".format(num * num * num))
"""
def print_square():
num = 2
"""
function to print square of given num
"""
print("Square: {}".format(num * num))
def hola():
import requests
a = requests.get('https://google.com')
print(a.status_code)
if __name__ == "__main__":
for j in range(10):
t1 = threading.Thread(target=hola)
t1.start()
t1.join()
but when i run the code the code run 1 print per time, in my case give me
200
1 sec later again 200
and 200 again (x 10 times because i added 10 thread)
but i want know how i can do for this execute as fast possible without show me the 10 output, just i want the code do 1 print but as fast possible with 10 thread for example
You can simply use a for loop.
number_of_threads is the number of how many threads u want to run
for _ in range(number_of_threads):
t = threading.Thread(target=hola)
t.start()
t.join()

Why using locks to read/write shared memory is faster than not using locks

I am experimenting Python multiprocessing and I am surprised to see counter intuitive results while using locks.
My assumption was that while using shared memory, if we use a lock to ensure only one process can access the shared object at a time, all the other processes will wait for the locking process to release the lock before reading/writing on the shared object. So although this ensures data integrity and eliminates race conditions, this will also inevitably cause all but one process to wait on a locking process. As a result the overall time of execution will be longer than while not using locks. If no locks are used, none of the processes will wait on each other and so will finish faster but will also corrupt the data.
With this understanding in mind, I wrote a quick program to find out exactly how much slower is it to use a lock, but surprisingly it seems using a lock is actually significantly faster.
Can you explain why is it faster to use a lock? Is something wrong with my understanding of how locks work? Or is there something wrong with the way I’ve setup the experiment (code below).
If lock.acquire() and lock.release() are commented out, so NOT using lock, it takes 75.85 seconds,
If lock.acquire() and lock.release() are Uncommented, so using lock, it takes 57.17 seconds
I am using Python 3.7,
Running on Intel i7 8550U, 4 physical cores, (8 threads with hyperthreading)
import os
import multiprocessing as mp
from timeit import default_timer as timer
def func(n, val, lock, add):
for i in range(n):
lock.acquire() # uncomment to use lock
if add:
val.value += 1
else:
val.value -= 1
lock.release() # uncomment to use lock
if i % (n/10) == 0:
print(f"Process: {os.getpid()}, {(i / n * 100):.2f}%")
if __name__ == "__main__":
start = timer()
lock = mp.Lock()
n = 2000000
val = mp.Value('i', 100)
arr = mp.Array('i', 10)
p1 = mp.Process(target=func, args=(n, val, lock, True))
p2 = mp.Process(target=func, args=(n, val, lock, False))
p1.start()
p2.start()
p1.join()
p2.join()
elapsed = timer() - start
print("Time elapsed = {}".format(elapsed))
print(f"Val = {val.value}")

Python Multithreading vs Multiprocessing vs Sequential Execution

I have the below code:
import time
from threading import Thread
from multiprocessing import Process
def fun1():
for _ in xrange(10000000):
print 'in fun1'
pass
def fun2():
for _ in xrange(10000000):
print 'in fun2'
pass
def fun3():
for _ in xrange(10000000):
print 'in fun3'
pass
def fun4():
for _ in xrange(10000000):
print 'in fun4'
pass
if __name__ == '__main__':
#t1 = Thread(target=fun1, args=())
t1 = Process(target=fun1, args=())
#t2 = Thread(target=fun2, args=())
t2 = Process(target=fun2, args=())
#t3 = Thread(target=fun3, args=())
t3 = Process(target=fun3, args=())
#t4 = Thread(target=fun4, args=())
t4 = Process(target=fun4, args=())
t1.start()
t2.start()
t3.start()
t4.start()
start = time.clock()
t1.join()
t2.join()
t3.join()
t4.join()
end = time.clock()
print("Time Taken = ",end-start)
'''
start = time.clock()
fun1()
fun2()
fun3()
fun4()
end = time.clock()
print("Time Taken = ",end-start)
'''
I ran the above program in three ways:
First Sequential Execution ALONE(look at the commented code and comment the upper code)
Second Multithreaded Execution ALONE
Third Multiprocessing Execution ALONE
The observations for end_time-start time are as follows:
Overall Running times
('Time Taken = ', 342.5981313667716) --- Running time by threaded execution
('Time Taken = ', 232.94691744899296) --- Running time by sequential Execution
('Time Taken = ', 307.91093406618216) --- Running time by Multiprocessing execution
Question :
I see sequential execution takes least time and Multithreading takes highest time. Why? I am unable to understand and also surprised by results.Please clarify.
Since this is a CPU intensive task and GIL is acquired, my understanding was
Multiprocessing would take least time while threaded execution would take highest time.Please validate my understanding.
You use time.clock, wich gave you CPU time and not real time : you can't use that in your case, as it gives you the execution time (how long did you use the CPU to run your code, wich will be almost the same time for each of these case)
Running your code with time.time() instead of time.clock gave me these time on my computer :
Process : ('Time Taken = ', 5.226783990859985)
seq : ('Time Taken = ', 6.3122560000000005)
Thread : ('Time Taken = ', 17.10062599182129)
The task given here (printing) is so fast that the speedup from using multiprocessing is almost balanced by the overhead.
For Threading, as you can only have one Thread running because of the GIL, you end up running all your functions sequentially BUT you had the overhead of threading (changing threads every few iterations can cost up to several milliseconds each time). So you end up with something much slower.
Threading is usefull if you have waiting times, so you can run tasks in between.
Multiprocessing is usefull for computationnally expensive tasks, if possible completely independant (no shared variables). If you need to share variables, then you have to face the GIL and it's a little bit more complicated (but not impossible most of the time).
EDIT : Actually, using time.clock like you did gave you the information about how much overhead using Threading and Multiprocessing cost you.
Basically you're right.
What platform do you use to run the code snippet? I guess Windows.
Be noticed that "print" is not CPU bound so you should comment out "print" and try to run it on Linux to see the difference (It should be what you expect).
Use code like this:
def fun1():
for _ in xrange(10000000):
# No print, and please run on linux
pass

Multithreading works slower

Good day!
I'm trying to learn multithreading features in python and I wrote the following code:
import time, argparse, threading, sys, subprocess, os
def item_fun(items, indices, lock):
for index in indices:
items[index] = items[index]*items[index]*items[index]
def map(items, cores):
count = len(items)
cpi = count/cores
threads = []
lock = threading.Lock()
for core in range(cores):
thread = threading.Thread(target=item_fun, args=(items, range(core*cpi, core*cpi + cpi), lock))
threads.append(thread)
thread.start()
item_fun(items, range((core+1)*cpi, count), lock)
for thread in threads:
thread.join()
parser = argparse.ArgumentParser(description='cube', usage='%(prog)s [options] -n')
parser.add_argument('-n', action='store', help='number', dest='n', default='1000000', metavar = '')
parser.add_argument('-mp', action='store_true', help='multi thread', dest='mp', default='True')
args = parser.parse_args()
items = range(NUMBER_OF_ITEMS)
# print 'items before:'
# print items
mp = args.mp
if mp is True:
NUMBER_OF_PROCESSORS = int(os.getenv("NUMBER_OF_PROCESSORS"))
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
map(items, NUMBER_OF_PROCESSORS)
end = time.time()
else:
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
item_fun(items, range(NUMBER_OF_ITEMS), None)
end = time.time()
#print 'items after:'
#print items
print 'time elapsed: ', (end - start)
When I use mp argument, it works slower, on my machine with 4 cpus, it takes about 0.5 secs to compute result, while if I use a single thread it takes about 0.3 secs.
Am I doing something wrong?
I know there's Pool.map() and e.t.c but it spawns subprocess not threads and it works faster as far as I know, but I'd like to write my own thread pool.
Python has no true multithreading, due to an implementation detail called the "GIL". Only one thread actually runs at a time, and Python switches between the threads. (Third party implementations of Python, such as Jython, can actually run parallel threads.)
As to why actually your program is slower in the multithreaded version depends, but when coding for Python, one needs to be aware of the GIL, so one does not believe that CPU bound loads are more efficiently processed by adding threads to the program.
Other things to be aware of are for instance multiprocessing and numpy for solving CPU bound loads, and PyEv (minimal) and Tornado (huge kitchen sink) for solving I/O bound loads.
You'll only see an increase in throughput with threads in Python if you have threads which are IO bound. If what you're doing is CPU bound then you won't see any throughput increase.
Turning on the thread support in Python (by starting another thread) also seems to make some things slower so you may find that overall performance still suffers.
This is all cpython of course, other Python implementations have different behaviour.

Python threading outperforms simple while loop OR threading Optimization

A few hours ago, I asked a question about Python multithreading. To understand how it works, I have performed some experiments, and here are my tests:
Python script which uses threads:
import threading
import Queue
import time
s = 0;
class ThreadClass(threading.Thread):
lck = threading.Lock()
def __init__(self, inQ, outQ):
threading.Thread.__init__(self)
self.inQ = inQ
self.outQ = outQ
def run(self):
while True:
global s
#print self.getName()+" is running..."
self.item = self.inQ.get()
#self.inQ.task_done()
ThreadClass.lck.acquire()
s += self.item
ThreadClass.lck.release()
#self.inQ.task_done()
self.outQ.put(self.item)
self.inQ.task_done()
inQ = Queue.Queue()
outQ = Queue.Queue()
i = 0
n = 1000000
print "putting items to input"
while i<n:
inQ.put(i)
i += 1
start_time = time.time()
print "starting threads..."
for i in xrange(10):
t = ThreadClass(inQ, outQ);
t.setDaemon(True)
t.start()
inQ.join()
end_time = time.time()
print "Elapsed time is: %s"%(end_time - start_time)
print s
The following has the same functionality with a simple while loop:
import Queue
import time
inQ = Queue.Queue()
outQ = Queue.Queue()
i = 0
n = 1000000
sum = 0
print "putting items to input"
while i<n:
inQ.put(i)
i += 1
print "while loop starts..."
start_time = time.time()
while inQ.qsize() > 0:
item = inQ.get()
sum += item
outQ.put(item)
end_time = time.time()
print "Elapsed time is: %s"%(end_time - start_time)
print sum
If you run these programs on your machine, you can see that threads are much slower than a simple while loop. I am a bit confused about threads and want to know what is wrong with the threaded code. How can I optimize it (in this situation), and why it is slower than the while loop?
threading is always tricky, by threading in Python is special.
To discuss optimization, you have to focus on special cases, otherwise there is no single answer.
The initial thread solution on my computer runs on 37.11 s. If you use a local variable to sum the elements of each thread and then lock only at the end, the time drops to 32.62s.
Ok. The no thread solution runs on 7.47 s. Great. But if you want to sum a ton of numbers in Python, you just use the built in function sum. So, if we use a List with no threads and the sum built in, the time drops to 0.09 s. Great!
Why?
Threads in Python are subject to the Global Interpreter Lock (GIL). They will never run Python code in parallel. They are real threads, but internally, they are only allowed to run X Python instructions before releasing the GIL to another thread. For very simple calculations, the cost of creating a thread, locking and context switching is much bigger than the cost of your simple computation. So in this case, the overhead is 5 times bigger than the computation itself. Threading in Python is interesting when you can't use async I/O or when you have blocking functions that should run at the same time.
But, why the sum built in is faster than the Python no thread solution? The sum built in is implemented in C, and Python loops suck performance wise. So it is much faster to iterate all elements of the list using the built in sum.
Is it always the case? No, it depends on what you are doing. If you were writing these numbers to n different files, the threading solution could have a chance, as the GIL is released during I/O. But even then, we would need to check if I/O buffering/disk sync time would not be game changers. This kind of detail makes a final answer very difficult. So, if you want to optimize something, you must have exactly what you have to optimize. To sum a list of numbers in Python, just use the sum built in.

Categories