Why doesn't this program run faster when using multiple threads?

Why doesn't this program run faster when using multiple threads? - python

start = time.time()
for i in range (260):
if i==259:
print ("a done\n")
break
for o in range (260):
if o==259:
print ("b done\n")
break
end = time.time()
print (end-start)
This code is taking time almost " 0.007 " and its reading first the (I) loop then when it finish it start reading the (O) loop
My problem is this code:
start = time.time()
def a():
for i in range (260):
if i==259:
print (" a done \n")
break
def b():
for o in range (260):
if o==259:
print (" b done \n")
break
th1=Thread(target=a)
th2=Thread(target=b)
th1.start()
th2.start()
th1.join()
th2.join()
end = time.time()
print (end-start)
Even that this code is reading the (I) loop and the (O) loop in the same time both of them but its still taking almost "0.008"
And I am thinking that the second code should take half of the time of the first code, but I don't understand why its taking almost the same time as the first code?

This is due to the global interpreter lock (GIL). I/O and C extensions are outside of the GIL, but threads of pure Python work do not actually work in parallel.
If you do want to perform Python code in parallel, see multiprocessing. It has an API similar to threading (as well as new functionality too).
from multiprocessing import Process
p1 = Process(target=a)
p2 = Process(target=b)
p1.start()
p2.start()
p1.join()
p2.join()

Related

Running Python Threads Simotaneously

I'm looking at running some code to auto-save a game every X minutes but it also has a thread accepting keyboard input. Here's some sample code that I'm trying to get running simultaneously but it appears they run one after the other. How can I get them to run at the same time?
import time
import threading
def countdown(length,delay):
length += 1
while length > 0:
time.sleep(delay)
length -= 1
print(length, end=" ")
countdown_thread = threading.Thread(target=countdown(3,2)).start()
countdown_thread2 = threading.Thread(target=countdown(3,1)).start()
Update: Not sure really what the difference python has between a process and a thread (would process show as a second process in Windows?) But here's my updated code. It still runs sequentially and not at the same time.
import time
from threading import Thread
from multiprocessing import Process
def countdown(length,delay):
length += 1
while length > 0:
time.sleep(delay)
length -= 1
print(length, end=" ")
p1 = Process(target=countdown(5,0.3))
print ("")
p2 = Process(target=countdown(10,0.1))
print ("")
Thread(target=countdown(5,0.3))
print ("")
Thread(target=countdown(10,0.1))

When you create the threads, they should be created as
Thread(target=countdown, args=(3,2))
As-is, it runs countdown(3,2), and passes the result as the Thread target!

AFAIK Threads cant run simultainously.
I suggest you take a look at multiprocessing instead:
https://docs.python.org/3/library/multiprocessing.html

Multiprocessing in Python2.7

I'm trying to understand "multiprocessing" module more through examples before i start applying it to my main code,and i get little confused from the execution sequence in this code .
The Code :
import multiprocessing as mp
import time
import os
def square( nums , r , t1 ) :
print ("square started at :")
print ("%.6f" % (time.clock()-t1))
for n in nums :
r.append(n*n)
print ("square endeded at :")
print ("%.6f" % (time.clock()-t1))
def cube ( nums , r , t1 ) :
#time.sleep(2)
print ("cube started at :")
print ("%.6f" % (time.clock()-t1))
for n in nums :
r.append(n*n*n)
print ("cube endeded at :")
print ("%.6f" % (time.clock()-t1))
if __name__ == "__main__" :
numbers = range(1,1000000)
results1 = []
results2 = []
t1 = time.clock()
# With multiprocessing :
p1 = mp.Process(target = square , args = (numbers , results1 , t1))
p2 = mp.Process(target = cube , args = (numbers , results2 , t1))
p1.start()
#time.sleep(2)
p2.start()
p1.join()
print ("After p1.join() :")
print ("%.6f" % (time.clock()-t1))
p2.join()
'''
# Without multiprocessing :
square(numbers , results1 ,t1)
cube(numbers , results2 , t1)
'''
print ("square + cube :")
print ("%.6f" % (time.clock()-t1))
The code output was :
square started at :
0.000000
square endeded at :
0.637105
After p1.join() :
12.310289
cube started at :
0.000000
cube endeded at :
0.730428
square + cube :
13.057885
And i have few questions :
according to the code and timing above should it be in this order ?
square started at :
cube started at :
square endeded at :
cube endeded at :
After p1.join() :
square + cube :
why it takes so long from the program to reach (p1.join()) despite of it finished the "square" several seconds earlier ?
in another word why square & cube take around 13 seconds to run while there real time execution is 0.7s!
in my main code i would like to start the second function (cube in this example) after a one second delay from the first function,so i tried to put a delay (time.sleep(1)) between "p1.start()" and "p2.start()"but it did not work and both functions still starting at (0.000000s),then i placed the delay at the the beggining of the "cube " function and it also did not work , so my question is how to achive a delay between this two functions ?

When dealing with multithreading, all sorts of other factors can impact what you are seeing. Since you are literally adding subprocesses to the process manager of your OS, they will operate entirely separately from your running program, including having their own resources, scheduling priorities, pipes, etc.
1.) No. The reason is that each child process gets its own output buffer that it is writing into that gets written back into the parent process. Since you start both child processes and then tell the parent process to block the thread until subprocess p1 completes, the p2 child process cannot write its buffer into the parent until the p1 process completes. This is why, despite waiting 12 seconds, the output of the p2 process still says 0.7 seconds.
2.) It is difficult to know for certain why it took 12 seconds for the subprocess to run its course. It may be something in your code or it may be a dozen other reasons, such as a complete different process hijacking your CPU for a time. First off, time.clock is probably not what you are looking for if you are trying to measure actual time versus how much time the process has spent on the CPU. Other commenters have correctly recommended using a high performance counter to accurately track timings to make sure there is not any weirdness in the way you are measuring time. Furthermore, there is always some level of overhead, though certainly not 12 seconds worth, when starting, running and terminating a new process. The best way to determine if this 12 seconds is something you could have controlled for or not is to run the application multiple times and see if there is a wide variance of total resulting times. If there is, it may be other conditions related to the computer running it.
3.) I am guessing that the issue is the time.clock measurement. The time.clock call calculates how much time a process has spent on the CPU. Since you are using multiple processes, time.clock is reset to 0 when a process is started. It is a relative time, not an absolute time, and relative to the lifespan of the process. If you are jumping between processes or sleeping threads, time.clock won't necessarily increment the way you would think with an absolute time measurement. You should use something like time.time() or better yet, a high performance counter to properly track real time.

Python: Very strange behavior with multiprocessing; later code causes "retroactive" slowdown of earlier code

I'm trying to learn how to implement multiprocessing for computing Monte Carlo simulations. I reproduced the code from this simple tutorial where the aim is to compute an integral. I also compare it to the answer from WolframAlpha and compute the error. The first part of my code has no problems and is just there to define the integral function and declare some constants:
import numpy as np
import multiprocessing as mp
import time
def integrate(iterations):
np.random.seed()
mc_sum = 0
chunks = 10000
chunk_size = int(iterations/chunks)
for i in range(chunks):
u = np.random.uniform(size=chunk_size)
mc_sum += np.sum(np.exp(-u * u))
normed = mc_sum / iterations
return normed
wolfram_answer = 0.746824132812427
mc_iterations = 1000000000
But there's some very spooky stuff that happens in the next two parts (I've labelled them because it's important). First (labelled "BLOCK 1"), I do the simulation without any multiprocessing at all, just to get a benchmark. After this (labelled "BLOCK 2"), I do the same thing but with a multiprocessing step. If you're reproducing this, you may want to adjust the num_procs variable depending on how many cores your machines has:
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
1000000000 iterations on single-thread: 37.448 seconds.
Estimation error: 1.17978774235e-05
8 threads with 125000000 iterations each: 54.697 seconds.
Estimation error: -5.88380936901e-06
So, the multiprocessing is slower. That's not at all unheard of; maybe the overhead from the multiprocessing is just more than the gains from the parallelization?
But, that is not what is happening. Watch what happens when I merely comment out the first block:
#### BLOCK 1
##single_before = time.time()
##single = integrate(mc_iterations)
##single_after = time.time()
##single_duration = np.round(single_after - single_before, 3)
##error_single = (wolfram_answer - single)/wolfram_answer
##
##print(mc_iterations, "iterations on single-thread:",
## single_duration, "seconds.")
##print("Estimation error:", error_single)
##print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
8 threads with 125000000 iterations each: 6.662 seconds.
Estimation error: 3.86063069069e-06
That's right -- the time to complete the multiprocessing goes down from 55 seconds to less than 7 seconds! And that's not even the weirdest part. Watch what happens when I move Block 1 to be after Block 2:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 54.938 seconds.
Estimation error: 7.42415402896e-06
1000000000 iterations on single-thread: 37.396 seconds.
Estimation error: 9.79800494235e-06
We're back to the slow output again, which is completely crazy! Isn't Python supposed to be interpreted? I know that statement comes with a hundred caveats, but I took for granted that the code gets executed line-by-line, so stuff that comes afterwards (outside of functions, classes, etc) can't affect the stuff from before, because it hasn't been "looked at" yet.
So, how can the stuff that gets executed after the multiprocessing step has concluded, retroactively slow down the multiprocessing code?
Finally, the fast behavior is restored merely by indenting Block 1 to be inside the if __name__ == "__main__" block, because of course it does:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 7.293 seconds.
Estimation error: 1.10350027622e-05
1000000000 iterations on single-thread: 31.035 seconds.
Estimation error: 2.53582945763e-05
And the fast behavior is also restored if you keep Block 1 inside the if block, but move it to above where num_procs is defined (not shown here because this question is already getting long).
So, what on Earth is causing this behavior? I'm guessing it's some kind of race-condition to do with threading and process branching, but from my level of expertise it might as well be that my Python interpreter is haunted.

This is because you are using Windows. On Windows, each subprocess is generated using the 'spawn' method which essentially starts a new python interpreter and imports your module instead of forking the process.
This is a problem, because all the code outside if __name__ == '__main__' is executed again. This can lead to a multiprocessing bomb if you put the multiprocessing code at the top-level, because it will start spawning processes until you run out of memory.
This is actually warned about in the docs
Safe importing of main module
Make sure that the main module can be safely imported by a new Python
interpreter without causing unintended side effects (such a starting a
new process).
...
Instead one should protect the “entry point” of the program by using
if __name__ == '__main__'
...
This allows the newly spawned Python interpreter to safely import the
module...
That section used to be called "Windows" in the older docs on Python 2.

Adding some detail, on Windows the module is imported "from scratch" in each worker process. That means everything in the module is executed by each worker. So, in your first example, each worker process first executes "BLOCK 1".
But your output doesn't reflect that. You should have gotten a line of output like
1000000000 iterations on single-thread: 37.448 seconds.
from each of your 8 worker processes. But your output doesn't show that. Perhaps you're using an IDE that suppresses output from spawned processes? If you run it in a "DOS box" (cmd.exe window) instead, that won't suppress output, and can make what's going on clearer.

Printing time in Python multiprocessing script return negative time elapsed

Running it on Ubuntu 14 with Python 2.7.6
I simplified script to show my problem:
import time
import multiprocessing
data = range(1, 3)
start_time = time.clock()
def lol():
for i in data:
print time.clock() - start_time, "lol seconds"
def worker(n):
print time.clock() - start_time, "multiprocesor seconds"
def mp_handler():
p = multiprocessing.Pool(1)
p.map(worker, data)
if __name__ == '__main__':
lol()
mp_handler()
And the output:
8e-06 lol seconds
6.9e-05 lol seconds
-0.030019 multiprocesor seconds
-0.029907 multiprocesor seconds
Process finished with exit code 0
Using time.time() gives non-negative values (as marked here Timer shows negative time elapsed)
but I'm curious what is the problem with time.clock() in python multiprocessing and reading time from CPU.

multiprocessing spawns new processes and time.clock() on linux has the same meaning of the C's clock():
The value returned is the CPU time used so far as a clock_t;
So the values returned by clock restart from 0 when a process start. However your code uses the parent's process start_time to determine the time spent in the child process, which is obviously incorrect if the child processes CPU time resets.
The clock() function makes sense only when handling one process, because its return value is the CPU time spent by that process. Child processes are not taken into account.
The time() function on the other hand uses a system-wide clock, and thus can be used even between different processes (although it is not monotonic, so it might return wrong results if somebody changes the system time during the events).
Forking a running python instance is probably faster then starting a new one from scratch, hence start_time is almost always bigger then the value returned by time.clock().
Take into account that the parent process also had to read your file on disk, perform the imports which may require reading other .py files, searching directories etc.
The forked child processes don't have to do all that.
Example code that shows that the return value of time.clock() resets to 0:
from __future__ import print_function
import time
import multiprocessing
data = range(1, 3)
start_time = time.clock()
def lol():
for i in data:
t = time.clock()
print('t: ', t, end='\t')
print(t - start_time, "lol seconds")
def worker(n):
t = time.clock()
print('t: ', t, end='\t')
print(t - start_time, "multiprocesor seconds")
def mp_handler():
p = multiprocessing.Pool(1)
p.map(worker, data)
if __name__ == '__main__':
print('start_time', start_time)
lol()
mp_handler()
Result:
$python ./testing.py
start_time 0.020721
t: 0.020779 5.8e-05 lol seconds
t: 0.020804 8.3e-05 lol seconds
t: 0.001036 -0.019685 multiprocesor seconds
t: 0.001166 -0.019555 multiprocesor seconds
Note how t is monotonic for the lol case while goes back to 0.001 in the other case.

To add a concise Python 3 example to Bakuriu's excellent answer above you can use the following method to get a global timer independent of the subprocesses:
import multiprocessing as mp
import time
# create iterable
iterable = range(4)
# adds three to the given element
def add_3(num):
a = num + 3
return a
# multiprocessing attempt
def main():
pool = mp.Pool(2)
results = pool.map(add_3, iterable)
return results
if __name__ == "__main__": #Required not to spawn deviant children
start=time.time()
results = main()
print(list(results))
elapsed = (time.time() - start)
print("\n","time elapsed is :", elapsed)
Note that if we had instead used time.process_time() instead of time.time() we will get an undesired result.

Multiprocessing takes longer?

I am trying to understand how to get children to write to a parent's variables. Maybe I'm doing something wrong here, but I would have imagined that multiprocessing would have taken a fraction of the time that it is actually taking:
import multiprocessing, time
def h(x):
h.q.put('Doing: ' + str(x))
return x
def f_init(q):
h.q = q
def main():
q = multiprocessing.Queue()
p = multiprocessing.Pool(None, f_init, [q])
results = p.imap(h, range(1,5))
p.close()
-----Results-----:
1
2
3
4
Multiprocessed: 0.0695610046387 seconds
1
2
3
4
Normal: 2.78949737549e-05 seconds # much shorter
for i in range(len(range(1,5))):
print results.next() # prints 1, 4, 9, 16
if __name__ == '__main__':
start = time.time()
main()
print "Multiprocessed: %s seconds" % (time.time()-start)
start = time.time()
for i in range(1,5):
print i
print "Normal: %s seconds" % (time.time()-start)

#Blender basically already answered your question, but as a comment. There is some overhead associated with the multiprocessing machinery, so if you incur the overhead without doing any significant work, it will be slower.
Try actually doing some work that parallelizes well. For example, write Python code to open a file, scan it using a regular expression, and pull out matching lines; then make a list of ten big files and time how long it takes to do all ten with multiprocessing vs. plain Python. Or write code to compute an expensive function and try that.
I have used multiprocessing.Pool() just to run a bunch of instances of an external program. I used subprocess to run an audio encoder, and it ran four instances of the encoder at once for a noticeable speedup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.