How do I make functions run as subprocesses in Python? - python

I want my Python script to be able to run one of its functions as subprocesses. How should I do that?
Here is a mock-up script of my intention:
#!/urs/bin/env python
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
for foo in [1,2,3]:
print_mynumber(foo) # Each call of this function should span a new process.
# subprocess(print_mynumber(foo))
Thank you for your suggestions. It is a little hard for me to formulate the problem correctly, and thus to make the appropriate web search.

Use the multiprocessing module:
import multiprocessing as mp
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
if __name__ == '__main__':
for foo in [1,2,3]:
proc = mp.Process(target = print_mynumber, args = (foo, ))
proc.start()
You might not want to be creating one process for each call to print_mynumber, especially if the list foo iterates over is long. A better way in that case would be to use a multiprocessing pool:
import multiprocessing as mp
def print_mynumber(foo):
"""This function is obviously more complicated in my script.
It should be run as a subprocess."""
print(foo)
if __name__ == '__main__':
pool = mp.Pool()
pool.map(print_mynumber, [1,2,3])
The pool, be default, will create N worker processes, where N is the number of cpus (or cores) the machine possesses. pool.map behaves much like the Python builtin map command, except that it farms out tasks to the pool of workers.

Related

How to call a linux command line program in parallel with python

I have a command-line program which runs on single core. It takes an input file, does some calculations, and returns several files which I need to parse to store the produced output.
I have to call the program several times changing the input file. To speed up the things I was thinking parallelization would be useful.
Until now I have performed this task calling every run separately within a loop with the subprocess module.
I wrote a script which creates a new working folder on every run and than calls the execution of the program whose output is directed to that folder and returns some data which I need to store. My question is, how can I adapt the following code, found here, to execute my script always using the indicated amount of CPUs, and storing the output.
Note that each run has a unique running time.
Here the mentioned code:
import subprocess
import multiprocessing as mp
from tqdm import tqdm
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
pool = mp.Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
I am not entirely clear what your issue is. I have some recommendations for improvement below, but you seem to claim on the page that you link to that everything works as expected and I don't see anything very wrong with the code as long as you are running on Linux.
Since the subprocess.call method is already creating a new process, you should just be using multithreading to invoke your worker function, work. But had you been using multiprocessing and your platform was one that used the spawn method to create new processes (such as Windows), then having the creation of the progress bar outside of the if __name__ = '__main__': block would have resulted in the creation of 4 additional progress bars that did nothing. Not good! So for portability it would have been best to move its creation to inside the if __name__ = '__main__': block.
import subprocess
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = ThreadPool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
Note: If your worker.py program prints to the console, it will mess up the progress bar (the progress bar will be re-written repeatedly on multiple lines).
Have you considered instead importing worker.py (some refactoring of that code might be necessary) instead of invoking a new Python interpreter to execute it (in this case you would want to be explicitly using multiprocessing). On Windows this might not save you anything since a new Python interpreter would be executed for each new process anyway, but this could save you on Linux:
import subprocess
from multiprocessing.pool import Pool
from worker import do_work
from tqdm import tqdm
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(do_work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()

Use multiprocessing to run functions inside a while loop in a class method

I have a method which calculates a final result using multiple other methods. It has a while loop inside which continuously checks for new data, and if new data is received, it runs the other methods and calculates the results. This main method is the only one which is called by the user, and it stays active until the program is closed. the basic structure is as follows:
class sample:
def __init__(self):
results = []
def main_calculation(self):
while True:
#code to get data
if newdata != olddata:
#insert code to prepare data for analysis
res1 = self.calc1(prepped_data)
res2 = self.calc2(prepped_data)
final = res1 + res2
self.results.append(final)
I want to run calc1 and calc2 in parallel, so that I can get the final result faster. However, I am unsure of how to implement multiprocessing in this way, since I'm not using a __main__ guard. Is there any way to run these processes in parallel?
This is likely not the best organization for this code, but it is what is easiest for the actual calculations I am running, since it is necessary that this code be imported and run from a different file. However, I can restructure the code if this is not a salvageable structure.
According to the documentation, the reason you need to use a __main__ guard is that when your program creates a multiprocessing.Process object, it starts up a whole new copy of the Python interpreter which will import a new copy of your program's modules. If importing your module calls multiprocessing.Process() itself, that will create yet another copy of the Python interpreter which interprets yet another copy of your code, and so on until your system crashes (or actually, until Python hits a non-reentrant piece of the multiprocessing code).
In the main module of your program, which usually calls some code at the top level, checking __name__ == '__main__' is the way you can tell whether the program is being run for the first time or is being run as a subprocess. But in a different module, there might not be any code at the top level (other than definitions), and in that case there's no need to use a guard because the module can be safely imported without starting a new process.
In other words, this is dangerous:
import multiprocessing as mp
def f():
...
p = mp.Process(target=f)
p.start()
p.join()
but this is safe:
import multiprocessing as mp
def f():
...
def g():
p = mp.Process(target=f)
p.start()
p.join()
and this is also safe:
import multiprocessing as mp
def f():
...
class H:
def g(self):
p = mp.Process(target=f)
p.start()
p.join()
So in your example, you should be able to directly create Process objects in your function.
However, I'd suggest making it clear in the documentation for the class that that method creates a Process, because whoever uses it (maybe you) needs to know that it's not safe to call that method at the top level of a module. It would be like doing this, which also falls in the "dangerous" category:
import multiprocessing as mp
def f():
...
class H:
def g(self):
p = mp.Process(target=f)
p.start()
p.join()
H().g() # this creates a Process at the top level
You could also consider an alternative approach where you make the caller do all the process creation. In this approach, either your sample class constructor or the main_calculation() method could accept, say, a Pool object, and it can use the processes from that pool to do its calculations. For example:
class sample:
def main_calculation(self, pool):
while True:
if newdata != olddata:
res1_async = pool.apply_async(self.calc1, [prepped_data])
res2_async = pool.apply_async(self.calc2, [prepped_data])
res1 = res1_async.get()
res2 = res2_async.get()
# and so on
This pattern may also allow your program to be more efficient in its use of resources, if there are many different calculations happening, because they can all use the same pool of processes.

Pythons parallel processing

I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)
According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'
You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.

Spawning multiple processes with Python

Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000

python multiprocessing can't find error

I'm trying to run a function with multiprocessing. This is the code:
import multiprocessing as mu
output = []
def f(x):
output.append(x*x)
jobs = []
np = mu.cpu_count()
for n in range(np*500):
p = mu.Process(target=f, args=(n,))
jobs.append(p)
running = []
for i in range(np):
p = jobs.pop()
running.append(p)
p.start()
while jobs != []:
for r in running:
if r.exitcode == 0:
try:
running.remove(r)
p = jobs.pop()
p.start()
running.append(p)
except IndexError:
break
print "Done:"
print output
The output is [], while it should be [1,4,9,...]. Someone sees where i'm making a mistake?
You are using multiprocessing, not threading. So your output list is not shared between the processes.
There are several possible solutions;
Retain most of your program but use a multiprocessing.Queue instead of a list. Let the workers put their results in the queue, and read it from the main program. It will copy data from process to process, so for big chunks of data this will have significant overhead.
You could use shared memory in the form of multiprocessing.Array. This might be the best solution if the processed data is large.
Use a Pool. This takes care of all the process management for you. Just like with a queue, it copies data from process to process. It is probably the easiest to use. IMO this is the best option if the data sent to/from each worker is small.
Use threading so that the output list is shared between threads. Threading in CPython has the restriction that only one thread at a time can be executing Python bytecode, so you might not get as much performance benefit as you'd expect. And unlike the multiprocessing solutions it will not take advantage of multiple cores.
Edit:
Thanks to #Roland Smith to point out.
The main problem is the function f(x). When child process call this, it's unable for them to fine the output variable (since it's not shared).
Edit:
Just as #cdarke said, in multiprocessing you have to carefully control the shared object that child process could access(maybe a lock), and it's pretty complicated and hard to debug.
Personally I suggest to use the Pool.map method for this.
For instance, I assume that you run this code directly, not as a module, then your code would be:
import multiprocessing as mu
def f(x):
return x*x
if __name__ == '__main__':
np = mu.cpu_count()
args = [n for n in range(np*500)]
pool = mu.Pool(processes=np)
result = pool.map(f, args)
pool.close()
pool.join()
print result
but there's something you must know
if you just run this file but not import with module, the if __name__ == '__main__': is important, since python will load this file as a module for other process, if you don't place the function 'f' outside if __name__ == '__main__':, the child process would not be able to find your function 'f'
**Edit:**thanks #Roland Smith point out that we could use tuple
if you have more then one args for the function f, then you might need a tuple to do so, for instance
def f((x,y))
return x*y
args = [(n,1) for n in range(np*500)]
result = pool.map(f, args)
or check here for more detailed discussion

Categories