I am currently scrapping data from various sites.
The code for the scrappers is stored in modules (x,y,z,a,b)
Where x.dump is a function which uses Files for storing the scraped data.
The dump function takes a single argument 'input'.
Note : All the dump functions are not same.
I am trying to run each of these dump function in parallel.
The following code runs fine.
But i have noticed that it still follows serial order x then y ... for execution.
Is this the correct way of going about the problem?
Are multithreading and multiprocessing the only native ways for parallel programming?
from multiprocessing import Process
import x.x as x
import y.y as y
import z.z as z
import a.a as a
import b.b as b
input = ""
f_list = [x.dump, y.dump, z.dump, a.dump, b.dump]
processes = []
for function in f_list:
processes.append(Process(target=function, args=(input,)))
for process in processes:
process.run()
for process in processes:
process.join()
That's because run() is the method to implement the task itself, you're not meant to call it from outside like that. You are supposed to call start() which spawns a new process which then calls run() in the other process and returns control to you so you can do more work (and later join()).
You should be calling process.start() not process.run()
The start method does the work of starting the extra process and then running the run method in that process.
Python docs
Related
I am learning Python and its multiprocessing.
I created a project with a mian() in main.py and a a_simulation inside the module simulation.py under the package simulator/.
The symptom is that a test statement print("hello\n") inside main.py before the definition of mian() is executed multiple times when the program is run with python main.py, indicating things before the print, including the creations of the lists are all executed multiple times.
I do not think I understand the related issues of python very well. May I know what is reason for the symptom and what is the best practice in python when creating projects like this? I have included the codes and the terminal prints. Thank you!
Edit: Forgot to mention that I am running it with anaconda python on macOS, although I would wish my project will work just fine on any platforms.
mian.py:
from multiprocessing import Pool
from simulator.simulation import a_simulation
import random
num_trials = 10
iter_trials = list(range(num_trials))
arg_list = [random.random() for _ in range(num_trials)]
input = list(zip(iter_trials, arg_list))
print("hello\n")
def main():
with Pool(processes=4) as pool:
result = pool.starmap(a_simulation, input)
print(result)
if __name__ == "__main__":
main()
simulatior/simulation.py:
import os
from time import sleep
def a_simulation(x, seed_):
print(f"Process {os.getpid()}: trial {x} received {seed_}\n" )
sleep(1)
return seed_
Results from the terminal:
hello
hello
hello
hello
hello
Process 71539: trial 0 received 0.4512600158461971
Process 71538: trial 1 received 0.8772526554425158
Process 71541: trial 2 received 0.6893833978242683
Process 71540: trial 3 received 0.29249994820563296
Process 71538: trial 4 received 0.5759647958461107
Process 71541: trial 5 received 0.08799525261308505
Process 71539: trial 6 received 0.3057644321667139
Process 71540: trial 7 received 0.5402091856171599
Process 71538: trial 8 received 0.1373456223147438
Process 71541: trial 9 received 0.24000943476017
[0.4512600158461971, 0.8772526554425158, 0.6893833978242683, 0.29249994820563296, 0.5759647958461107, 0.08799525261308505, 0.3057644321667139, 0.5402091856171599, 0.1373456223147438, 0.24000943476017]
(base)
The reason why this happens is because multiprocessing uses start method spawn, by default, on Windows and macOS to start new processes. What this means is that whenever you want to start a new process, the child process is initially created without sharing any of the memory of the parent. However, this makes things messy when you want to start a function in the child process from the parent because not only will the child not know the definition of the function itself, you might also run into some unexpected obstacles (what if the function depends on a variable defined in the parent processes' module?). To stop these sorts of things from happening, multiprocessing automatically imports the parent processes' module from the child process, which essentially copies almost the entire state of the parent when the child process was started.
This is where the if __name__ == "__main__" comes in. This statement basically translates to if the current file is being run directly then..., the code under this block will not run if the module is being imported. Therefore, the child process will not run anything under this block when they are spawned. You can hence use this block to create, for example, variables which use up a lot of memory and are not required for the child processes to function but are used by the parent. Basically, anything that the child processes won't need, throw it under here.
Now coming to your comment about imports:
This must be a silly questions, but should I leave the import statements as they are, or move them inside if name == "main":, or somewhere else? Thanks
Like I said, anything that the child doesn't need can be put under this if block. The reason you don't often see imports under this block is perhaps due to sticking to convention ("imports should be done at the top") and because the modules being imported don't really affect performance much (even after being needlessly imported multiple times). Keep in mind however, that if a child process requires a particular module to start its work, it will always be imported again within the child process, even if you have imported it under the if __name__... block. This is because when you attempt to spawn child processes to start a function in parallel, multiprocessing automatically serializes and sends the names of the function, and the module that defines the function (actual code is not serialized, only the names), to the child processes where they are imported once more (relevant question).
This is only specific to when the start method is spawn, you can read more about the differences here
Objective
a process (.exe) with multiple input arguments
Multiple files. For each the above mentioned process shall be executed
I want to use python to parallelize the process
I am using subprocess.Popen to create the processes and afterwards keep a maximum of N parallel processes.
For testing purposes, I want to parallelize a simple script like "cmd timeout 5".
State of work
import subprocess
count = 10
parallel = 2
processes = []
for i in range(0,count):
while (len(processes) >= parallel):
for process in processes:
if (process.poll() is None):
processes.remove(process)
break
process = subprocess.Popen(["cmd", "/c timeout 5"])
processes.append(process)
[...]
I read somewhere that a good approach for checking if a process is running would be is not None like shown in the code.
Question
I am somehow struggling to set it up correctly, especially the Popen([...]) part. In some cases, all processes are executed without considering the maximum parallel count and in other cases, it doesnt work at all.
I guess that there has to be a part where the process is closed if finished.
Thanks!
You will probably have a better time using the built-in multiprocessing module to manage the subprocesses running your tasks.
The reason I've wrapped the command in a dict is that imap_unordered (which is faster than imap but doesn't guarantee ordered execution since any worker process can grab any job – whether that's okay for you is your business problem) doesn't have a starmap alternative, so it's easier to unpack a single "job" within the callable.
import multiprocessing
import subprocess
def run_command(job):
# TODO: add other things here?
subprocess.check_call(job["command"])
def main():
with multiprocessing.Pool(2) as p:
jobs = [{"command": ["cmd", "/c timeout 5"]} for x in range(10)]
for result in p.imap_unordered(run_command, jobs):
pass
if __name__ == "__main__":
main()
I'm trying to understand multiprocessing. My actual application is to display log messages in real time on a pyqt5 GUI, but I ran into some problems using queues so I made a simple program to test it out.
The issue I'm seeing is that I am unable to add elements to a Queue across python modules and across processes. Here is my code and my output, along with the expected output.
Config file for globals:
# cfg.py
# Using a config file to import my globals across modules
#import queue
import multiprocessing
# q = queue.Queue()
q = multiprocessing.Queue()
Main module:
# mod1.py
import cfg
import mod2
import multiprocessing
def testq():
global q
print("q has {} elements".format(cfg.q.qsize()))
if __name__ == '__main__':
testq()
p = multiprocessing.Process(target=mod2.add_to_q)
p.start()
p.join()
testq()
mod2.pullfromq()
testq()
Secondary module:
# mod2.py
import cfg
def add_to_q():
cfg.q.put("Hello")
cfg.q.put("World!")
print("qsize in add_to_q is {}".format(cfg.q.qsize()))
def pullfromq():
if not cfg.q.empty():
msg = cfg.q.get()
print(msg)
Here is the output that I actually get from this:
q has 0 elements
qsize in add_to_q is 2
q has 0 elements
q has 0 elements
vs the output that I would expect to get:
q has 0 elements
qsize in add_to_q is 2
q has 2 elements
Hello
q has 1 elements
So far I have tried using both multiprocessing.Queue and queue.Queue. I have also tested this with and without Process.join().
If I run the same program without using multiprocessing, I get the expected output shown above.
What am I doing wrong here?
EDIT:
Process.run() gives me the expected output, but it also blocks the main process while it is running, which is not what I want to do.
My understanding is that Process.run() runs the created process in the context of the calling process (in my case the main process), meaning that it is no different from the main process calling the same function.
I still don't understand why my queue behavior isn't working as expected
I've discovered the root of the issue and I'll document it here for future searches, but I'd still like to know if there's a standard solution to creating a global queue between modules so I'll accept any other answers/comments.
I found the problem when I added the following to my cfg.py file.
print("cfg.py is running in process {}".format(multiprocessing.current_process()))
This gave me the following output:
cfg.py is running in process <_MainProcess(MainProcess, started)>
cfg.py is running in process <_MainProcess(Process-1, started)>
cfg.py is running in process <_MainProcess(Process-2, started)>
It would appear that I'm creating separate Queue objects for each process that I create, which would certainly explain why they aren't interacting as expected.
This question has a comment stating that
a shared queue needs to originate from the master process, which is then passed to all of its subprocesses.
All this being said, I'd still like to know if there is an effective way to share a global queue between modules without having to pass it between methods.
Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000
I would like to run a series of commands (which take a long time). But I do not want to wait for the completion of each command. How can I go about this in Python?
I looked at
os.fork()
and
subprocess.popen()
Don't think that is what I need.
Code
def command1():
wait(10)
def command2():
wait(10)
def command3():
wait(10)
I would like to call
command1()
command2()
command3()
Without having to wait.
Use python's multiprocessing module.
def func(arg1):
... do something ...
from multiprocessing import Process
p = Process(target=func, args=(arg1,), name='func')
p.start()
Complete Documentaion is over here: https://docs.python.org/2/library/multiprocessing.html
EDIT:
You can also use the Threading module of python if you are using jpython/cpython distribution as you can overcome the GIL (Global Interpreter Lock) in these distributions.
https://docs.python.org/2/library/threading.html
This example maybe is suitable for you:
#!/usr/bin/env python3
import sys
import os
import time
def forked(fork_func):
def do_fork():
pid = os.fork()
if (pid > 0):
fork_func()
exit(0)
else:
return pid
return do_fork
#forked
def command1():
time.sleep(2)
#forked
def command2():
time.sleep(1)
command1()
command2()
print("Hello")
You just use decorator #forked for your functions.
There is only one problem: when main program is over, it waits for end of child processes.
The most straightforward way is to use Python's own multiprocessing:
from multiprocessing import Process
def command1():
wait(10)
...
call1 = Process(target=command1, args=(...))
call1.start()
...
This module was introduced back exactly to ease the burden on controlling external process execution of functions accessible in the same code-base Of course, that could already be done by using os.fork, subprocess. Multiprocessing emulates as far as possible, Python's own threading moudle interface. The one immediate advantage of using multiprocessing over threading is that this enables the various worker processes to make use of different CPU cores, actually working in parallel - while threading, effectively, due to language design limitations is actually limited to a single execution worker at once, thus making use of a single core even when several are available.
Now, note that there are still peculiarities - specially if you are, for example, calling these from inside a web-request. Check this question an answers form a few days ago:
Stop a background process in flask without creating zombie processes