Why the python threads is so frustrating?

Why the python threads is so frustrating? - python

# test.py
import threading
import time
import random
from itertools import count
def fib(n):
"""fibonacci sequence
"""
if n < 2:
return n
else:
return fib(n - 1) + fib(n - 2)
if __name__ == '__main__':
counter = count(1)
start_time = time.time()
def thread_worker():
while True:
try:
# To simulate downloading
time.sleep(random.randint(5, 10))
# To simulate doing some process, will take about 0.14 ~ 0.63 second
fib(n=random.randint(28, 31))
finally:
finished_number = counter.next()
print 'Has finished %d, the average speed is %f per second.' % (finished_number, finished_number/(time.time() - start_time))
threads = [threading.Thread(target=thread_worker) for i in range(100)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
The above is my test script.
The thread_worker function takes at most 10.63 seconds to run once.
I started 100 threads and expected the results is ~10 times per second.
But the actual results were frustrating as following:
...
Has finished 839, the average speed is 1.385970 per second.
Has finished 840, the average speed is 1.386356 per second.
Has finished 841, the average speed is 1.387525 per second.
...
And if i commented "fib(n=random.randint(28, 31))" out, the results is expected:
...
Has finished 1026, the average speed is 12.982740 per second.
Has finished 1027, the average speed is 12.995230 per second.
Has finished 1028, the average speed is 13.007719 per second.
...
Has finished 1029, the average speed is 12.860571 per second.
My question is why it's so slow? I expected ~10 per second.
How to make it faster?
fib() function is just to simulate doing some process. e.g. extracting data from big html.

If I ask you to bake a cake and this takes you an hour and a half, 30 minutes for the dough, and 60 minutes in the oven, by your logic I would expect 2 cakes to take exactly the same amount of time. However there are some things you are missing. First if I do not tell you to bake two cakes in the beginning you have to make the dough twice, wich is now 2 times 30 minutes. now it actually takes you two hours ( you are free to work on the second cakeonce the first is in the oven).
Now lets assume i ask you to bake four cakes, again I do not allow you to make the dough once and split it for four cakes but you have to make it every time. The time we would expect now is 4*30minutes+ one hour for the alst cake to bake. Now for the sake of example assume your wife helps you, meaning you can do the dough for two cakes in parallel. THe time expected now is two hours, since every person has to bake two cakes. However the oven you have can only fit 2 cakes at a time. The time now becomes 30 minutes to makte the first dough, 1h hour to bake it, while you make the second dough, and after the first two cakes are done you put the next two cakes in the oven. WHich take a nother hour. If you add up the times you will see that it now took you 2 and a half hours.
If you take this further and I ask you for thousand cakes it will take you 500 and a half hours.
What has this to do with threads?
Think of making the dough as an initial computation that creates 100% cpu load. Your wife is the second core in a dual core. The oven is a resource, for which your programm generates 50% load.
In real threadiing you have some overhead to start the threads (I told you toi bake the cakes, you have to ask your wife for help whcih takes time), you compete for resources (i.e. memory access)(you and your wife can"t use the mixer at the same time). THe speedup is sub linear even if the number of threads is smaller than the number fo cores.
Furthermore smart programs download their code once (make the dough once), in the main thread and than duplicate it to threads, there is no nead to duplicate a computation. It does not make it faster just because you compute it twice.

While Manoj's answer is correct I think it needs more explanation. The python GIL is a mutex used in cpython that essentially will disable any parallel execution of python code. It does not make threaded code slower, nor does it actually prevent the OS from scheduling python threads simultaneously on all your cores. It just makes sure only one thread can execute python byte code at the same time.
What does this mean for you? You essentially do two things:
Sleep: While performing this function no python code is being executed, you just do nothing for 5 to 10 seconds. In the meantime any other thread can do exactly the same thing. Given that the overhead of calling time.sleep is negligible, you could have thousands of threads and it will probably still scale linearly like you expected. This is why everything works as soon as you comment out the fib line. Your average sleep time is 7.5s so you'd expect 15 calculations per second.
A calculation of the Fibonacci sequence: This one is the problem, it is actually executing python code. Let's say it takes about 0.5s per calculation. Now we've seen that you can only run one calculation at the time, no matter how many threads you have. Given that, you'd only get to 2 calculations per second.
Now, it's lower than 15 and 2, mainly because there is some overhead involved. First of all you are printing out data to the screen, this is almost always a surprisingly slow operation. Secondly, you're using 100 threads, which means that you're constantly switching between 100 thread stacks (even if they're sleeping), which is not a lightweight operation.
Note that threading can still be very useful though. For example for blocking calls where the execution is not done by python itself but some other resource. This could be waiting for the result of a socket, a sleep like in your example or even a calculation that is done outside of python itself (e.g. many numpy calculations).

Python threads use Global Interpretor Lock (GIL) to synchronize access of Python interpretor state. Compared to other threads like POSIX threads, usage of GIL can make Python threads significantly slower, especially when dealing with multiple cores. This is well-known. Here is a really good presentation on the same: www.dabeaz.com/python/UnderstandingGIL.pdf‎

You're looking for a faster solution. Memoizing results helps.
import collections
import functools
class Memoized(object):
"""Decorator. Caches a function's return value each time it is called.
If called later with the same arguments, the cached value is returned
(not reevaluated).
"""
def __init__(self, func):
self.func = func
self.cache = {}
def __call__(self, *args):
if not isinstance(args, collections.Hashable):
# uncacheable. a list, for instance.
# better to not cache than blow up.
return self.func(*args)
if args in self.cache:
return self.cache[args]
else:
value = self.func(*args)
self.cache[args] = value
return value
def __repr__(self):
"""Return the function's docstring."""
return self.func.__doc__
def __get__(self, obj, objtype):
"""Support instance methods."""
return functools.partial(self.__call__, obj)
if __name__ == '__main__':
#Memoized
def fibonacci(n):
"""Return the nth fibonacci number
:param n: value
"""
if n in (0, 1):
return n
return fibonacci(n - 1) + fibonacci(n - 2)
print(fibonacci(35))
Try to run it with and without the #Memoized decorator.
Recipe taken from http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize.

Related

Python Creating Thread Pool for function

I have a function that gives values fast and I have a second function that calculates something from that value which takes around 10-20 seconds. I need to create a thread pool to concurrently run around 40 of the second function at the same time since the first function releases values so fast. How can I go about doing this?
Eg:
def function1(): #gives values fast
while(values not complete):
newvalue = xyz
UseAvailableThread(function2(newvalue)) # wait if thread isn't available
def function2(value):
#computes some data - takes 10-20 seconds
I hope the example helps, how can I go about doing this?

Why is this multiprocess program running slowly than its not concurrent version?

I have made a program for adding a list by dividing them in subparts and using multiprocessing in Python. My code is the following:
from concurrent.futures import ProcessPoolExecutor, as_completed
import random
import time
def dummyFun(l):
s=0
for i in range(0,len(l)):
s=s+l[i]
return s
def sumaSec(v):
start=time.time()
sT=0
for k in range(0,len(v),10):
vc=v[k:k+10]
print ("vector ",vc)
for item in vc:
sT=sT+item
print ("sequential sum result ",sT)
sT=0
start1=time.time()
print ("sequential version time ",start1-start)
def main():
workers=5
vector=random.sample(range(1,101),100)
print (vector)
sumaSec(vector)
dim=10
sT=0
for k in range(0,len(vector),dim):
vc=vector[k:k+dim]
print (vc)
for item in vc:
sT=sT+item
print ("sub list result ",sT)
sT=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),10))
start=time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(dummyFun,chunk) for chunk in chunks]
for future in as_completed(futures):
print (future.result())
start1=time.time()
print (start1-start)
if __name__=="__main__":
main()
The problem is that for the sequential version I got a time of:
0.0009753704071044922
while for the concurrent version my time is:
0.10629010200500488
And when I reduce the number of workers to 2 my time is:
0.08622884750366211
Why is this happening?

The length of your vector is only 100. That is a very small amount of work, so the the fixed cost of starting the process pool is the most significant part of the runtime. For this reason parallelism is most beneficial when there is a lot of work to do. Try a larger vector, like a length of 1 million.
The second problem is that you have each worker do a tiny amount of work: a chunk of size 10. Again, that means the cost of starting a task cannot be amortized over so little work. Use larger chunks. For example, instead of 10 use int(len(vector)/(workers*10)).
Also note that you're creating 5 processes. For a CPU-bound task like this one you ideally want to use the same number of processes as you have physical CPU cores. Either use whatever number of cores your system has, or if you use max_workers=None (the default value) then ProcessPoolExecutor will default to that number for your system. If you use too few processes you're leaving performance on the table, if you use too many then the CPU will have to switch between them and your performance may suffer.

Your chunking is pretty awful for creating multiple tasks.
Creating too many tasks still incurs the time punishment even when your workers are already created.
Maybe this post can help you in your search:
How to parallel sum a loop using multiprocessing in Python

Python most accurate method to measure time (ms)

I need to meassure the time certain parts of my code take. While executing my code on a powerfull server, I get 10 diffrent results
I tried comparing time measured with time.time(), time.perf_counter(), time.perf_counter_ns(), time.process_time() and time.process_time_ns().
import time
for _ in range(10):
start = time.perf_counter()
i = 0
while i < 100000:
i = i + 1
time.sleep(1)
end = time.perf_counter()
print(end - start)
I'm expecting when executing the same code 10 times, to be the same (the results to have a resolution of at least 1ms) ex. 1.041XX and not 1.030sec - 1.046sec.
When executing my code on a 16 cpu, 32gb memory server I'm receiving this result:
1.045549364
1.030857833
1.0466020120000001
1.0309665050000003
1.0464690349999994
1.046397238
1.0309525370000001
1.0312070380000007
1.0307592159999999
1.046095523
Im expacting the result to be:
1.041549364
1.041857833
1.0416020120000001
1.0419665050000003
1.0414690349999994
1.041397238
1.0419525370000001
1.0412070380000007
1.0417592159999999
1.041095523

Your expectations are wrong. If you want to measure code average time consumption use the timeit module. It executes your code multiple times and averages over the times.
The reason your code has different runtimes lies in your code:
time.sleep(1) # ensures (3.5+) _at least_ 1000ms are waited, won't be less, might be more
You are calling it in a tight loop,resulting in accumulated differences:
Quote from time.sleep(..) documentation:
Suspend execution of the calling thread for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal’s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
Changed in version 3.5: The function now sleeps at least secs even if the sleep is interrupted by a signal, except if the signal handler raises an exception (see PEP 475 for the rationale).
Emphasis mine.

Perfoming a code do not take the same time at each loop iteration because of the scheduling of the system (system puts on hold your process to perform another process then back to it...).

Why is a ThreadPoolExecutor with one worker still faster than normal execution?

I'm using this library, Tomorrow, that in turn uses the ThreadPoolExecutor from the standard library, in order to allow for Async function calls.
Calling the decorator #tomorrow.threads(1) spins up a ThreadPoolExecutor with 1 worker.
Question
Why is it faster to execute a function using 1 thread worker over just calling it as is (e.g. normally)?
Why is it slower to execute the same code with 10 thread workers in place of just 1, or even None?
Demo code
imports excluded
def openSync(path: str):
for row in open(path):
for _ in row:
pass
#tomorrow.threads(1)
def openAsync1(path: str):
openSync(path)
#tomorrow.threads(10)
def openAsync10(path: str):
openSync(path)
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
[func(p) for p in paths]
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Note: The data folder contains 18 files, each 700 lines of random text.
Output
0 workers: 0.0120 seconds
1 worker: 0.0009 seconds
10 workers: 0.0535 seconds
What I've tested
I've ran the code more than a couple dusin times, with different programs running in the background (ran a bunch yesterday, and a couple today). The numbers change, ofc, but the order is always the same. (I.e. 1 is fastest, then 0 then 10).
I've also tried changing the order of execution (e.g. moving the do calls around) in order to eliminate caching as a factor, but still the same.
Turns out that executing in the order 10, 1, None results in a different order (1 is fastest, then 10, then 0) compared to every other permutation. The result shows that whatever do call is executed last, is considerably slower than it would have been had it been executed first or in the middle instead.
Results (After receiving solution from #Dunes)
0 workers: 0.0122 seconds
1 worker: 0.0214 seconds
10 workers: 0.0296 seconds

When you call one of your async functions it returns a "futures" object (instance of tomorrow.Tomorrow in this case). This allows you to submit all your jobs without having to wait for them to finish. However, never actually wait for the jobs to finish. So all do(openAsync1) does is time how long it takes to setup all the jobs (should be very fast). For a more accurate test you need to do something like:
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
# do all jobs if openSync, else start all jobs if openAsync
results = [func(p) for p in paths]
# if openAsync, the following waits until all jobs are finished
if func is not openSync:
for r in results:
r._wait()
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Using additional threads in python generally slows things down. This is because of the global interpreter lock which means only 1 thread can ever be active, regardless of the number of cores the CPU has.
However, things are complicated by the fact that your job is IO bound. More worker threads might speed things up. This is because a single thread might spend more time waiting for the hard drive to respond than is lost between context switching between the various threads in the multi-threaded variant.
Side note, even though neither openAsync1 and openAsync10 wait for jobs to complete, do(openAsync10) is probably slower because it requires more synchronisation between threads when submitting a new job.

Control the speed of a loop

I'm currently reading physics in the university, and im learning python as a little hobby.
To practise both at the same time, i figured I'll write a little "physics engine" that calculates the movement of an object based on x,y and z coordinates. Im only gonna return the movement in text (at least for now!) but i want the position updates to be real-time.
To do that i need to update the position of an object, lets say a hundred times a second, and print it back to the screen. So every 10 ms the program prints the current position.
So if the execution of the calculations take 2 ms, then the loop must wait 8ms before it prints and recalculate for the next position.
Whats the best way of constructing a loop like that, and is 100 times a second a fair frequency or would you go slower, like 25 times/sec?

The basic way to wait in python is to import time and use time.sleep. Then the question is, how long to sleep? This depends on how you want to handle cases where your loop misses the desired timing. The following implementation tries to catch up to the target interval if it misses.
import time
import random
def doTimeConsumingStep(N):
"""
This represents the computational part of your simulation.
For the sake of illustration, I've set it up so that it takes a random
amount of time which is occasionally longer than the interval you want.
"""
r = random.random()
computationTime = N * (r + 0.2)
print("...computing for %f seconds..."%(computationTime,))
time.sleep(computationTime)
def timerTest(N=1):
repsCompleted = 0
beginningOfTime = time.clock()
start = time.clock()
goAgainAt = start + N
while 1:
print("Loop #%d at time %f"%(repsCompleted, time.clock() - beginningOfTime))
repsCompleted += 1
doTimeConsumingStep(N)
#If we missed our interval, iterate immediately and increment the target time
if time.clock() > goAgainAt:
print("Oops, missed an iteration")
goAgainAt += N
continue
#Otherwise, wait for next interval
timeToSleep = goAgainAt - time.clock()
goAgainAt += N
time.sleep(timeToSleep)
if __name__ == "__main__":
timerTest()
Note that you will miss your desired timing on a normal OS, so things like this are necessary. Note that even with asynchronous frameworks like tulip and twisted you can't guarantee timing on a normal operating system.

Since you cannot know in advance how long each iteration will take, you need some sort of event-driven loop. A possible solution would be using the twisted module, which is based on the reactor pattern.
from twisted.internet import task
from twisted.internet import reactor
delay = 0.1
def work():
print "called"
l = task.LoopingCall(work)
l.start(delay)
reactor.run()
However, as has been noted, don't expect a true real-time responsiveness.

A piece of warning. You may not expect a real time on a non-realtime system. The sleep family of calls guarantees at least a given delay, but may well delay you for more.
Therefore, once you returned from sleep, query current time, and make the calculations into the "future" (accounting for the calculation time).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.