What's the difference between ThreadPool vs Pool in the multiprocessing module? - python

Whats the difference between ThreadPool and Pool in multiprocessing module. When I try my code out, this is the main difference I see:
from multiprocessing import Pool
import os, time
print("hi outside of main()")
def hello(x):
print("inside hello()")
print("Proccess id: ", os.getpid())
time.sleep(3)
return x*x
if __name__ == "__main__":
p = Pool(5)
pool_output = p.map(hello, range(3))
print(pool_output)
I see the following output:
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
inside hello()
Proccess id: 13268
inside hello()
Proccess id: 11104
inside hello()
Proccess id: 13064
[0, 1, 4]
With "ThreadPool":
from multiprocessing.pool import ThreadPool
import os, time
print("hi outside of main()")
def hello(x):
print("inside hello()")
print("Proccess id: ", os.getpid())
time.sleep(3)
return x*x
if __name__ == "__main__":
p = ThreadPool(5)
pool_output = p.map(hello, range(3))
print(pool_output)
I see the following output:
hi outside of main()
inside hello()
inside hello()
Proccess id: 15204
Proccess id: 15204
inside hello()
Proccess id: 15204
[0, 1, 4]
My questions are:
why is the “outside __main__()” run each time in the Pool?
multiprocessing.pool.ThreadPool doesn't spawn new processes? It just creates new threads?
If so whats the difference between using multiprocessing.pool.ThreadPool as opposed to just threading module?
I don't see any official documentation for ThreadPool anywhere, can someone help me out where I can find it?

The multiprocessing.pool.ThreadPool behaves the same as the multiprocessing.Pool with the only difference that uses threads instead of processes to run the workers logic.
The reason you see
hi outside of main()
being printed multiple times with the multiprocessing.Pool is due to the fact that the pool will spawn 5 independent processes. Each process will initialize its own Python interpreter and load the module resulting in the top level print being executed again.
Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads.
The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation. You can see its implementation in the source code.
I believe the next natural question is: when to use a thread based pool and when to use a process based one?
The rule of thumb is:
IO bound jobs -> multiprocessing.pool.ThreadPool
CPU bound jobs -> multiprocessing.Pool
Hybrid jobs -> depends on the workload, I usually prefer the multiprocessing.Pool due to the advantage process isolation brings
On Python 3 you might want to take a look at the concurrent.future.Executor pool implementations.

Related

Pythons parallel processing

I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)
According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'
You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.

Spawning multiple processes with Python

Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000

Can't access global variable in python

I'm using multi processing library in python in code below:
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f():
global test
print('hello')
print("before: "+test)
test = "Second"
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
It's supposed to change the value of test so at last the value of test must be Second, but the value doesn't change and remains First.
here is the output:
hello
before: First
after: First
The behavior you're seeing is because p is a new process, not a new thread. When you spawn a new process, it copies your initial process's state completely and then starts executing in parallel. When you spawn a thread, it shares memory with your initial thread.
Since processes have memory isolation, they won't create race-condition errors caused by reading and writing to shared memory. However, to get data from your child process back into the parent, you'll need to use some form of inter-process communication like a pipe, and because they fork memory, they are more expensive to spawn. As always in computer science, you have to make a tradeoff.
For more information, see:
https://en.wikipedia.org/wiki/Process_(computing)
https://en.wikipedia.org/wiki/Thread_(computing)
https://en.wikipedia.org/wiki/Inter-process_communication
Based on what you're actually trying to accomplish, consider using threads instead.
Global state is not shared so the changes made by child processes has no effect.
Here is why:
Actually it does change the global variable but only for the spawned
process. If you would access it within your process you can see it. As
its a process your global variable environment will be initialized but
the modification you make will be limited to the process itself and
not the whole.
Try this It explains whats happening
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f2():
print ("f2:" + test)
def f():
global test
print ('hello')
print ("before: "+test)
test = "Second"
f2()
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
If you really need to use modify from the process their's another way of doing it, read this doc or post it might help you.

Python multiprocessing restarting the application

I am a newbie in python multiprocessing, and came across some behaviour which looks strange to me but I guess it is normal. Here is a minimal working code:
import multiprocessing
print("Thread name: " + __name__)
def printfunc(text):
print(text)
if __name__ == '__main__':
multiprocessing.freeze_support()
texts=["aaa","bbb"]
pool = multiprocessing.Pool(2)
result = pool.map(printfunc, texts)
pool.close()
the output I get from running this is:
Thread name: __main__
Thread name: __parents_main__
Thread name: __parents_main__
aaa
bbb
evidently the multiprocessing thread executes not only the printfunc function but the whole code from the start. And apparently the way to distinguish between the main thread and the "children"-threads is if __name__ == '__main__': condition; main thread's name is __main__ and childrens' names are __parents_main__. However for my work I need to freeze my code and create Windows executable, and when I run it, all the threads have the name __main__ and this creates problems.
Are there ways to:
a) Make it so that the application is not "restarted" after calling Pool.map ?
b) If it's impossible, how to properly freeze the application so threads have different names (I use cx_Freeze) ?
c) If this is also impossible, how this behaviour could be prevented in any other way ?
I use python 2.7.
Thanks

How do I make functions run as subprocesses in Python?

I want my Python script to be able to run one of its functions as subprocesses. How should I do that?
Here is a mock-up script of my intention:
#!/urs/bin/env python
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
for foo in [1,2,3]:
print_mynumber(foo) # Each call of this function should span a new process.
# subprocess(print_mynumber(foo))
Thank you for your suggestions. It is a little hard for me to formulate the problem correctly, and thus to make the appropriate web search.
Use the multiprocessing module:
import multiprocessing as mp
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
if __name__ == '__main__':
for foo in [1,2,3]:
proc = mp.Process(target = print_mynumber, args = (foo, ))
proc.start()
You might not want to be creating one process for each call to print_mynumber, especially if the list foo iterates over is long. A better way in that case would be to use a multiprocessing pool:
import multiprocessing as mp
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
if __name__ == '__main__':
pool = mp.Pool()
pool.map(print_mynumber, [1,2,3])
The pool, be default, will create N worker processes, where N is the number of cpus (or cores) the machine possesses. pool.map behaves much like the Python builtin map command, except that it farms out tasks to the pool of workers.

Categories