A few hours ago, I asked a question about Python multithreading. To understand how it works, I have performed some experiments, and here are my tests:
Python script which uses threads:
import threading
import Queue
import time
s = 0;
class ThreadClass(threading.Thread):
lck = threading.Lock()
def __init__(self, inQ, outQ):
threading.Thread.__init__(self)
self.inQ = inQ
self.outQ = outQ
def run(self):
while True:
global s
#print self.getName()+" is running..."
self.item = self.inQ.get()
#self.inQ.task_done()
ThreadClass.lck.acquire()
s += self.item
ThreadClass.lck.release()
#self.inQ.task_done()
self.outQ.put(self.item)
self.inQ.task_done()
inQ = Queue.Queue()
outQ = Queue.Queue()
i = 0
n = 1000000
print "putting items to input"
while i<n:
inQ.put(i)
i += 1
start_time = time.time()
print "starting threads..."
for i in xrange(10):
t = ThreadClass(inQ, outQ);
t.setDaemon(True)
t.start()
inQ.join()
end_time = time.time()
print "Elapsed time is: %s"%(end_time - start_time)
print s
The following has the same functionality with a simple while loop:
import Queue
import time
inQ = Queue.Queue()
outQ = Queue.Queue()
i = 0
n = 1000000
sum = 0
print "putting items to input"
while i<n:
inQ.put(i)
i += 1
print "while loop starts..."
start_time = time.time()
while inQ.qsize() > 0:
item = inQ.get()
sum += item
outQ.put(item)
end_time = time.time()
print "Elapsed time is: %s"%(end_time - start_time)
print sum
If you run these programs on your machine, you can see that threads are much slower than a simple while loop. I am a bit confused about threads and want to know what is wrong with the threaded code. How can I optimize it (in this situation), and why it is slower than the while loop?
threading is always tricky, by threading in Python is special.
To discuss optimization, you have to focus on special cases, otherwise there is no single answer.
The initial thread solution on my computer runs on 37.11 s. If you use a local variable to sum the elements of each thread and then lock only at the end, the time drops to 32.62s.
Ok. The no thread solution runs on 7.47 s. Great. But if you want to sum a ton of numbers in Python, you just use the built in function sum. So, if we use a List with no threads and the sum built in, the time drops to 0.09 s. Great!
Why?
Threads in Python are subject to the Global Interpreter Lock (GIL). They will never run Python code in parallel. They are real threads, but internally, they are only allowed to run X Python instructions before releasing the GIL to another thread. For very simple calculations, the cost of creating a thread, locking and context switching is much bigger than the cost of your simple computation. So in this case, the overhead is 5 times bigger than the computation itself. Threading in Python is interesting when you can't use async I/O or when you have blocking functions that should run at the same time.
But, why the sum built in is faster than the Python no thread solution? The sum built in is implemented in C, and Python loops suck performance wise. So it is much faster to iterate all elements of the list using the built in sum.
Is it always the case? No, it depends on what you are doing. If you were writing these numbers to n different files, the threading solution could have a chance, as the GIL is released during I/O. But even then, we would need to check if I/O buffering/disk sync time would not be game changers. This kind of detail makes a final answer very difficult. So, if you want to optimize something, you must have exactly what you have to optimize. To sum a list of numbers in Python, just use the sum built in.
Related
I have realized that my multithreading program isn't doing what I think its doing. The following is a MWE of my strategy. In essence I'm creating nThreads threads but only actually using one of them. Could somebody help me understand my mistake and how to fix it?
import threading
import queue
NPerThread = 100
nThreads = 4
def worker(q: queue.Queue, oq: queue.Queue):
while True:
l = []
threadIData = q.get(block=True)
for i in range(threadIData["N"]):
l.append(f"hello {i} from thread {threading.current_thread().name}")
oq.put(l)
q.task_done()
threadData = [{} for i in range(nThreads)]
inputQ = queue.Queue()
outputQ = queue.Queue()
for threadI in range(nThreads):
threadData[threadI]["thread"] = threading.Thread(
target=worker, args=(inputQ, outputQ),
name=f"WorkerThread{threadI}"
)
threadData[threadI]["N"] = NPerThread
threadData[threadI]["thread"].setDaemon(True)
threadData[threadI]["thread"].start()
for threadI in range(nThreads):
# start and end are in units of 8 bytes.
inputQ.put(threadData[threadI])
inputQ.join()
outData = [None] * nThreads
count = 0
while not outputQ.empty():
outData[count] = outputQ.get()
count += 1
for i in outData:
assert len(i) == NPerThread
print(len(i))
print(outData)
edit
I only actually realised that I had made this mistake after profiling. Here's the output, for information:
In your sample program, the worker function is just executing so fast that the same thread is able to dequeue every item. If you add a time.sleep(1) call to it, you'll see other threads pick up some of the work.
However, it is important to understand if threads are the right choice for your real application, which presumably is doing actual work in the worker threads. As #jrbergen pointed out, because of the GIL, only one thread can execute Python bytecode at a time, so if your worker functions are executing CPU-bound Python code (meaning not doing blocking I/O or calling a library that releases the GIL), you're not going to get a performance benefit from threads. You'd need to use processes instead in that case.
I'll also note that you may want to use concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.ThreadPool for an out-of-the-box thread pool implementation, rather than creating your own.
I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.
For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)
def computesha(counter):
hash = 'somedata'+'otherdata'+str(counter)
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
return str(newHash), str(counter)
if __name__ == '__main__':
d1 = datetime.datetime.now()
print("Start timestamp" + str(d1))
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = Pool()
p.map(computesha, range(sys.maxsize) )
print(return_dict)
p.close()
p.join()
d2 = datetime.datetime.now()
print("End timestamp " + str(d2))
print("Elapsed time: " + str((d2-d1)))
I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks.
Is this possible or should I try a different approach?
Hi Alin and welcome to stackoverflow.
Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.
Regarding the RAM consumption of your code, I see two problems:
computesha returns a None value most of the time. This goes into the iterator which is created by map (even though you do not assign the return value of map). This means, that the iterator is a lot bigger than necessary.
Generally speaking, the RAM of a process is freed, after the process finishes. Your processes start A LOT of tasks which all reserve their own memory. A possible solution is the maxtasksperchild option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.
However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild option:
#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys
def computesha(process_number, number_of_processes, max_counter, results):
counter = process_number # every process starts with a different counter
data = 'somedata' + 'otherdata'
while counter < max_counter: #stop after max_counter jobs have been started
hash = "".join((data,str(counter)))
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
# return the results through a queue
results.put((str(newHash), str(counter)))
counter += number_of_processes # 'jump' to the next chunk
if __name__ == '__main__':
# execute this file with two command line arguments:
number_of_processes = int(sys.argv[1])
max_counter = int(sys.argv[2])
# this queue will be used to collect the results after the jobs finished
results = multiprocessing.Queue()
processes = []
# start a number of processes...
for i in range(number_of_processes):
p = multiprocessing.Process(target=computesha, args=(i,
number_of_processes,
max_counter,
results))
p.start()
processes.append(p)
# ... then wait for all processes to end
for p in processes:
p.join()
# collect results
while not results.empty():
print(results.get())
results.close()
This code spawns the desired number_of_processes which then call the computesha function. If number_of_processes=8 then the first process calculates the hash for the counter values [0,8,16,24,...], the second process for [1,9,17,25] and so on.
The advantages of this approach: In each iteration of the while loop the memory of hash, and newHash can be reused, loops are cheaper than functions and only number_of_processes function calls have to be made, and the uninteresting results are simply forgotten.
A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.
Hope that helps!
I have a problem in python where I want to run two loops at the same time. I feel like I need to do this because the second loop needs to be rate limited, but the first loop really shouldn't be rate limited. Also, the second loop takes an input from the first.
I'm looking fro something that works something like this:
for line in file:
do some stuff
list = []
list.append("an_item")
Rate limited:
for x in list:
do some stuff simultaneously
There are two basic approaches with different tradeoffs: synchronously switching between tasks, and running in threads or subprocesses. First, some common setup:
from queue import Queue # or Queue, if python 2
work = Queue()
def fast_task():
""" Do the fast thing """
if done:
return None
else:
return result
def slow_task(arg):
""" Do the slow thing """
RATE_LIMIT = 30 # seconds
Now, the synchronous approach. It has the advantage of being much simpler, and easier to debug, at the cost of being a bit slower. How much slower depends on the details of your tasks. How it works is, we run a tight loop that calls the fast job every time, and the slow job only if enough time has passed. If the fast job is no longer producing work and the queue is empty, we quit.
import time
last_call = 0
while True:
next_job = fast_task()
if next_job:
work.put(next_job)
elif work.empty():
# nothing left to do
break
else:
# fast task has done all its work - short sleep to slow the spin
time.sleep(.1)
now = time.time()
if now - last_call > RATE_LIMIT:
last_call = now
slow_task(work.get())
If you feel like this doesn't work fast enough, you can try the multiprocessing approach. You can use the same structure for working with threads or processes, depending on whether you import from multiprocessing.dummy or multiprocessing itself. We use a multiprocessing.Queue for communication instead of queue.Queue.
def do_the_fast_loop(work_queue):
while True:
next_job = fast_task()
if next_job:
work_queue.put(next_job)
else:
work_queue.put(None) # sentinel - tells slow process to quit
break
def do_the_slow_loop(work_queue):
next_call = time.time()
while True:
job = work_queue.get()
if job is None: # sentinel seen - no more work to do
break
time.sleep(max(0, next_call - time.time()))
next_call = time.time() + RATE_LIMIT
slow_task(job)
if __name__ == '__main__':
# from multiprocessing.dummy import Queue, Process # for threads
from multiprocessing import Queue, Process # for processes
work = Queue()
fast = Process(target=fast_task, args=(work,))
slow = Process(target=slow_task, args=(work,))
fast.start()
slow.start()
fast.join()
slow.join()
As you can see, there's quite a lot more machinery for you to implement, but it will be somewhat faster. Again, how much faster depends a lot on your tasks. I'd try all three approaches - synchronous, threaded, and multiprocess - and see which you like best.
You need to do 2 things:
Put the function require data from the other on its own process
Implement a way to communicate between the two processes (e.g. Queue)
All of this must be done thanks to the GIL.
I'm working on human genome which consists of 3.2 billions of characters and i have a list of objects which need to be searched within this data. Something like this:
result_final=[]
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result_1)
for item in objects:
result_2=function(item)
result_final.append(result_2)
Each object's search within the data takes nearly 30 seconds and i have few thousands of objects. I noticed that while doing this serially just 7% of CPU and 5% of RAM is being used. As i searched, for reducing the computation time i should do parallel computation using queuing , threading or multiprocessing. but they seem complicated for non-experts. could anybody help me how i can code for python to run 10 simultaneous searches and is it possible to make python to use maximum available CPU and RAM for multiprocessing? (I'm using Python33 on windows 7 with 64Gb RAM,COREI7 and 3.5 GH CPU)
You can use the multiprocessing module for this:
from multiprocessing import Pool
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result)
if __name__ == "__main__":
pool = Pool()
result_final = pool.map(function, objects)
This will allow you to scale the work across all available CPUs on your machine, because processes aren't affected by the GIL. You wouldn't want to run too many more tasks than there are CPUs available. Once you do that, you actually start slowing things down, because then the CPUs have to constantly switch between processes, which has a performance penalty.
Ok I'm not sure of your question, but I would do this (Note that there may be a better solution because I'm not an expert with the Queue Object) :
If you want to multithread your searches :
class myThread (threading.Thread):
def __init__(self, obj):
threading.Thread.__init__(self)
self.result = None
self.obj = obj
#Function who is called when you start your Thread
def run(self)
#Execute your function here
self.result = search_in_genome(self.obj)
if __name__ == '__main__':
result_final=[]
objects=['obj1','obj2','obj3',...]
#List of Thread
listThread = []
#Count number of potential thread
allThread = objects.len()
allThreadDone = 0
for item in objects:
#Create one thread
thread = myThread(item)
#Launch that Thread
thread.start()
#Stock it into the list
listThread.append(thread)
while True:
for thread in listThread:
#Count number of Thread who are finished
if thread.result != None:
#If a Thread is finished, count it
allThreadDone += 1
#If all thread are finished, then stop program
if allThreadDone == allThread:
break
#Else initialyse flag to count again
else:
allThreadDone = 0
If someone can check and validate this code that would be better. (Sorry for my english btw)
Good day!
I'm trying to learn multithreading features in python and I wrote the following code:
import time, argparse, threading, sys, subprocess, os
def item_fun(items, indices, lock):
for index in indices:
items[index] = items[index]*items[index]*items[index]
def map(items, cores):
count = len(items)
cpi = count/cores
threads = []
lock = threading.Lock()
for core in range(cores):
thread = threading.Thread(target=item_fun, args=(items, range(core*cpi, core*cpi + cpi), lock))
threads.append(thread)
thread.start()
item_fun(items, range((core+1)*cpi, count), lock)
for thread in threads:
thread.join()
parser = argparse.ArgumentParser(description='cube', usage='%(prog)s [options] -n')
parser.add_argument('-n', action='store', help='number', dest='n', default='1000000', metavar = '')
parser.add_argument('-mp', action='store_true', help='multi thread', dest='mp', default='True')
args = parser.parse_args()
items = range(NUMBER_OF_ITEMS)
# print 'items before:'
# print items
mp = args.mp
if mp is True:
NUMBER_OF_PROCESSORS = int(os.getenv("NUMBER_OF_PROCESSORS"))
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
map(items, NUMBER_OF_PROCESSORS)
end = time.time()
else:
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
item_fun(items, range(NUMBER_OF_ITEMS), None)
end = time.time()
#print 'items after:'
#print items
print 'time elapsed: ', (end - start)
When I use mp argument, it works slower, on my machine with 4 cpus, it takes about 0.5 secs to compute result, while if I use a single thread it takes about 0.3 secs.
Am I doing something wrong?
I know there's Pool.map() and e.t.c but it spawns subprocess not threads and it works faster as far as I know, but I'd like to write my own thread pool.
Python has no true multithreading, due to an implementation detail called the "GIL". Only one thread actually runs at a time, and Python switches between the threads. (Third party implementations of Python, such as Jython, can actually run parallel threads.)
As to why actually your program is slower in the multithreaded version depends, but when coding for Python, one needs to be aware of the GIL, so one does not believe that CPU bound loads are more efficiently processed by adding threads to the program.
Other things to be aware of are for instance multiprocessing and numpy for solving CPU bound loads, and PyEv (minimal) and Tornado (huge kitchen sink) for solving I/O bound loads.
You'll only see an increase in throughput with threads in Python if you have threads which are IO bound. If what you're doing is CPU bound then you won't see any throughput increase.
Turning on the thread support in Python (by starting another thread) also seems to make some things slower so you may find that overall performance still suffers.
This is all cpython of course, other Python implementations have different behaviour.