Multiprocessing in Python2.7

Multiprocessing in Python2.7 - python

I'm trying to understand "multiprocessing" module more through examples before i start applying it to my main code,and i get little confused from the execution sequence in this code .
The Code :
import multiprocessing as mp
import time
import os
def square( nums , r , t1 ) :
print ("square started at :")
print ("%.6f" % (time.clock()-t1))
for n in nums :
r.append(n*n)
print ("square endeded at :")
print ("%.6f" % (time.clock()-t1))
def cube ( nums , r , t1 ) :
#time.sleep(2)
print ("cube started at :")
print ("%.6f" % (time.clock()-t1))
for n in nums :
r.append(n*n*n)
print ("cube endeded at :")
print ("%.6f" % (time.clock()-t1))
if __name__ == "__main__" :
numbers = range(1,1000000)
results1 = []
results2 = []
t1 = time.clock()
# With multiprocessing :
p1 = mp.Process(target = square , args = (numbers , results1 , t1))
p2 = mp.Process(target = cube , args = (numbers , results2 , t1))
p1.start()
#time.sleep(2)
p2.start()
p1.join()
print ("After p1.join() :")
print ("%.6f" % (time.clock()-t1))
p2.join()
'''
# Without multiprocessing :
square(numbers , results1 ,t1)
cube(numbers , results2 , t1)
'''
print ("square + cube :")
print ("%.6f" % (time.clock()-t1))
The code output was :
square started at :
0.000000
square endeded at :
0.637105
After p1.join() :
12.310289
cube started at :
0.000000
cube endeded at :
0.730428
square + cube :
13.057885
And i have few questions :
according to the code and timing above should it be in this order ?
square started at :
cube started at :
square endeded at :
cube endeded at :
After p1.join() :
square + cube :
why it takes so long from the program to reach (p1.join()) despite of it finished the "square" several seconds earlier ?
in another word why square & cube take around 13 seconds to run while there real time execution is 0.7s!
in my main code i would like to start the second function (cube in this example) after a one second delay from the first function,so i tried to put a delay (time.sleep(1)) between "p1.start()" and "p2.start()"but it did not work and both functions still starting at (0.000000s),then i placed the delay at the the beggining of the "cube " function and it also did not work , so my question is how to achive a delay between this two functions ?

When dealing with multithreading, all sorts of other factors can impact what you are seeing. Since you are literally adding subprocesses to the process manager of your OS, they will operate entirely separately from your running program, including having their own resources, scheduling priorities, pipes, etc.
1.) No. The reason is that each child process gets its own output buffer that it is writing into that gets written back into the parent process. Since you start both child processes and then tell the parent process to block the thread until subprocess p1 completes, the p2 child process cannot write its buffer into the parent until the p1 process completes. This is why, despite waiting 12 seconds, the output of the p2 process still says 0.7 seconds.
2.) It is difficult to know for certain why it took 12 seconds for the subprocess to run its course. It may be something in your code or it may be a dozen other reasons, such as a complete different process hijacking your CPU for a time. First off, time.clock is probably not what you are looking for if you are trying to measure actual time versus how much time the process has spent on the CPU. Other commenters have correctly recommended using a high performance counter to accurately track timings to make sure there is not any weirdness in the way you are measuring time. Furthermore, there is always some level of overhead, though certainly not 12 seconds worth, when starting, running and terminating a new process. The best way to determine if this 12 seconds is something you could have controlled for or not is to run the application multiple times and see if there is a wide variance of total resulting times. If there is, it may be other conditions related to the computer running it.
3.) I am guessing that the issue is the time.clock measurement. The time.clock call calculates how much time a process has spent on the CPU. Since you are using multiple processes, time.clock is reset to 0 when a process is started. It is a relative time, not an absolute time, and relative to the lifespan of the process. If you are jumping between processes or sleeping threads, time.clock won't necessarily increment the way you would think with an absolute time measurement. You should use something like time.time() or better yet, a high performance counter to properly track real time.

Related

multiprocessing is always worse than single process no matter how many

I am playing around with multiprocessing in Python 3 to try and understand how it works and when it's good to use it.
I am basing my examples on this question, which is really old (2012).
My computer is a Windows, 4 physical cores, 8 logical cores.
First: not segmented data
First I try to brute force compute numpy.sinfor a million values. The million values is a single chunk, not segmented.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
if __name__ == '__main__':
pool = Pool(processes = 8)
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
And I get that, no matter the number of processes, the 'multi_threading' always takes 10 times or so as much as the 'single threading'. In the task manager, I see that not all the CPUs are maxed out, and the total CPU usage is goes between 18% and 31%.
So I try something else.
Second: segmented data
I try to split up the original 1 million computations in 10 batches of 100,000 each. Then I try again for 10 million computations in 10 batches of 1 million each.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
p = 3
s = 1000000
a = [numpy.arange(s) for _ in range(10)]
if __name__ == '__main__':
print('processes = {}'.format(p))
print('size = {}'.format(s))
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
pool = Pool(processes = p)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
I ran this last piece of code for different processes p and different list length s, 100000and 1000000.
At least now the task Manager gives the CPU maxed out at 100% usage.
I get the following results for the elapsed times (ORANGE: multiprocess, BLUE: single):
So multiprocessing never wins over the single process.
Why??

Numpy changes how the parent process runs so that it only runs on one core. You can call os.system("taskset -p 0xff %d" % os.getpid()) after you import numpy to reset the CPU affinity so that all cores are used.
See this question for more details

A computer can really only do one thing at a time. When multi-threading or multi-processing, the computer is really only switching back and forth between tasks quickly. With the provided problem, the computer could either perform the calculation 1,000,000 times, or split-up the work between a couple "workers" and perform 100,000 for each of 10 "workers".
Multi-processing shines not when computing something straight out, as the computer has to take time to create multiple processes, but while waiting for something. The main example I've heard is for webscraping. If a program requested data from a list of websites and waited for each server to send data before requesting data from the next, the program will have to sit for a couple seconds. If instead, the computer used multiprocessing/threading to ask all the websites first and all concurrently wait, the total running time is much shorter.

Python: Very strange behavior with multiprocessing; later code causes "retroactive" slowdown of earlier code

I'm trying to learn how to implement multiprocessing for computing Monte Carlo simulations. I reproduced the code from this simple tutorial where the aim is to compute an integral. I also compare it to the answer from WolframAlpha and compute the error. The first part of my code has no problems and is just there to define the integral function and declare some constants:
import numpy as np
import multiprocessing as mp
import time
def integrate(iterations):
np.random.seed()
mc_sum = 0
chunks = 10000
chunk_size = int(iterations/chunks)
for i in range(chunks):
u = np.random.uniform(size=chunk_size)
mc_sum += np.sum(np.exp(-u * u))
normed = mc_sum / iterations
return normed
wolfram_answer = 0.746824132812427
mc_iterations = 1000000000
But there's some very spooky stuff that happens in the next two parts (I've labelled them because it's important). First (labelled "BLOCK 1"), I do the simulation without any multiprocessing at all, just to get a benchmark. After this (labelled "BLOCK 2"), I do the same thing but with a multiprocessing step. If you're reproducing this, you may want to adjust the num_procs variable depending on how many cores your machines has:
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
1000000000 iterations on single-thread: 37.448 seconds.
Estimation error: 1.17978774235e-05
8 threads with 125000000 iterations each: 54.697 seconds.
Estimation error: -5.88380936901e-06
So, the multiprocessing is slower. That's not at all unheard of; maybe the overhead from the multiprocessing is just more than the gains from the parallelization?
But, that is not what is happening. Watch what happens when I merely comment out the first block:
#### BLOCK 1
##single_before = time.time()
##single = integrate(mc_iterations)
##single_after = time.time()
##single_duration = np.round(single_after - single_before, 3)
##error_single = (wolfram_answer - single)/wolfram_answer
##
##print(mc_iterations, "iterations on single-thread:",
## single_duration, "seconds.")
##print("Estimation error:", error_single)
##print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
8 threads with 125000000 iterations each: 6.662 seconds.
Estimation error: 3.86063069069e-06
That's right -- the time to complete the multiprocessing goes down from 55 seconds to less than 7 seconds! And that's not even the weirdest part. Watch what happens when I move Block 1 to be after Block 2:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 54.938 seconds.
Estimation error: 7.42415402896e-06
1000000000 iterations on single-thread: 37.396 seconds.
Estimation error: 9.79800494235e-06
We're back to the slow output again, which is completely crazy! Isn't Python supposed to be interpreted? I know that statement comes with a hundred caveats, but I took for granted that the code gets executed line-by-line, so stuff that comes afterwards (outside of functions, classes, etc) can't affect the stuff from before, because it hasn't been "looked at" yet.
So, how can the stuff that gets executed after the multiprocessing step has concluded, retroactively slow down the multiprocessing code?
Finally, the fast behavior is restored merely by indenting Block 1 to be inside the if __name__ == "__main__" block, because of course it does:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 7.293 seconds.
Estimation error: 1.10350027622e-05
1000000000 iterations on single-thread: 31.035 seconds.
Estimation error: 2.53582945763e-05
And the fast behavior is also restored if you keep Block 1 inside the if block, but move it to above where num_procs is defined (not shown here because this question is already getting long).
So, what on Earth is causing this behavior? I'm guessing it's some kind of race-condition to do with threading and process branching, but from my level of expertise it might as well be that my Python interpreter is haunted.

This is because you are using Windows. On Windows, each subprocess is generated using the 'spawn' method which essentially starts a new python interpreter and imports your module instead of forking the process.
This is a problem, because all the code outside if __name__ == '__main__' is executed again. This can lead to a multiprocessing bomb if you put the multiprocessing code at the top-level, because it will start spawning processes until you run out of memory.
This is actually warned about in the docs
Safe importing of main module
Make sure that the main module can be safely imported by a new Python
interpreter without causing unintended side effects (such a starting a
new process).
...
Instead one should protect the “entry point” of the program by using
if __name__ == '__main__'
...
This allows the newly spawned Python interpreter to safely import the
module...
That section used to be called "Windows" in the older docs on Python 2.

Adding some detail, on Windows the module is imported "from scratch" in each worker process. That means everything in the module is executed by each worker. So, in your first example, each worker process first executes "BLOCK 1".
But your output doesn't reflect that. You should have gotten a line of output like
1000000000 iterations on single-thread: 37.448 seconds.
from each of your 8 worker processes. But your output doesn't show that. Perhaps you're using an IDE that suppresses output from spawned processes? If you run it in a "DOS box" (cmd.exe window) instead, that won't suppress output, and can make what's going on clearer.

Why doesn't this program run faster when using multiple threads?

start = time.time()
for i in range (260):
if i==259:
print ("a done\n")
break
for o in range (260):
if o==259:
print ("b done\n")
break
end = time.time()
print (end-start)
This code is taking time almost " 0.007 " and its reading first the (I) loop then when it finish it start reading the (O) loop
My problem is this code:
start = time.time()
def a():
for i in range (260):
if i==259:
print (" a done \n")
break
def b():
for o in range (260):
if o==259:
print (" b done \n")
break
th1=Thread(target=a)
th2=Thread(target=b)
th1.start()
th2.start()
th1.join()
th2.join()
end = time.time()
print (end-start)
Even that this code is reading the (I) loop and the (O) loop in the same time both of them but its still taking almost "0.008"
And I am thinking that the second code should take half of the time of the first code, but I don't understand why its taking almost the same time as the first code?

This is due to the global interpreter lock (GIL). I/O and C extensions are outside of the GIL, but threads of pure Python work do not actually work in parallel.
If you do want to perform Python code in parallel, see multiprocessing. It has an API similar to threading (as well as new functionality too).
from multiprocessing import Process
p1 = Process(target=a)
p2 = Process(target=b)
p1.start()
p2.start()
p1.join()
p2.join()

threading/multiprocessing/Queue?

I need to add two (A & B) big random size (10^7 , 10^12) of vector with the help of two threads in python or multiprocessing. and then need to store it in C. I have to Time my code as well. and at last need to find minimum and average number from the final vector. I have tried so many things and currently working on Anaconda Jupyter notebook. It accept the code but not giving me any output.
this is my code
"import time
import multiprocessing
import numpy as np
import threading
add_result = []
a = np.random.rand(10000000)
b = np.random.rand(10000000)
def calc_add(numbers):
global add_results
for n in numbers:
print('add' + str(a+b))
add_result.append(a+b)
print('within a process result' +str(add_result))
time.Time = start_time
if __name__=="__main__":
arr = a+b
p1 = multiprocessing.Process(target = calc_add, args = (arr))
p2 = multiprocessing.Process(target = calc_add, args = (arr))
p1.start()
p2.start()
p1.join()
p2.join()
print("result" +str(add_result))
print("done!")

You can't do this kind of operations with multiprocessing, because (in Python) Processes are separate and don't share anything between themselves. That means your globalvariable is only global in the second process p1, which is why your add_result`variable is still equal to "[]".
Please add your code in your question so we can help you re-writing it.
You should also take a look at python's GIL to better understand why processes (and threads) can't help you with your task.

Multiprocessing takes longer?

I am trying to understand how to get children to write to a parent's variables. Maybe I'm doing something wrong here, but I would have imagined that multiprocessing would have taken a fraction of the time that it is actually taking:
import multiprocessing, time
def h(x):
h.q.put('Doing: ' + str(x))
return x
def f_init(q):
h.q = q
def main():
q = multiprocessing.Queue()
p = multiprocessing.Pool(None, f_init, [q])
results = p.imap(h, range(1,5))
p.close()
-----Results-----:
1
2
3
4
Multiprocessed: 0.0695610046387 seconds
1
2
3
4
Normal: 2.78949737549e-05 seconds # much shorter
for i in range(len(range(1,5))):
print results.next() # prints 1, 4, 9, 16
if __name__ == '__main__':
start = time.time()
main()
print "Multiprocessed: %s seconds" % (time.time()-start)
start = time.time()
for i in range(1,5):
print i
print "Normal: %s seconds" % (time.time()-start)

#Blender basically already answered your question, but as a comment. There is some overhead associated with the multiprocessing machinery, so if you incur the overhead without doing any significant work, it will be slower.
Try actually doing some work that parallelizes well. For example, write Python code to open a file, scan it using a regular expression, and pull out matching lines; then make a list of ten big files and time how long it takes to do all ten with multiprocessing vs. plain Python. Or write code to compute an expensive function and try that.
I have used multiprocessing.Pool() just to run a bunch of instances of an external program. I used subprocess to run an audio encoder, and it ran four instances of the encoder at once for a noticeable speedup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.