Difference in behavior between os.fork and multiprocessing.Process

Difference in behavior between os.fork and multiprocessing.Process - python

I have this code :
import os
pid = os.fork()
if pid == 0:
os.environ['HOME'] = "rep1"
external_function()
else:
os.environ['HOME'] = "rep2"
external_function()
and this code :
from multiprocessing import Process, Pipe
def f(conn):
os.environ['HOME'] = "rep1"
external_function()
conn.send(some_data)
conn.close()
if __name__ == '__main__':
os.environ['HOME'] = "rep2"
external_function()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv()
p.join()
The external_function initializes an external programs by creating the necessary sub-directories in the directory found in the environment variable HOME. This function does this work only once in each process.
With the first example, which uses os.fork(), the directories are created as expected. But with second example, which uses multiprocessing, only the directories in rep2 get created.
Why isn't the second example creating directories in both rep1 and rep2?

The answer you are looking for is in detail addressed here. There is also an explanation of differences between different OS.
One big issue is that the fork system call does not exist on Windows. Therefore, when running a Windows OS you cannot use this method. multiprocessing is a higher-level interface to execute a part of the currently running program. Therefore, it - as forking does - creates a copy of your process current state. That is to say, it takes care of the forking of your program for you.
Therefore, if available you could consider fork() a lower-level interface to forking a program, and the multiprocessing library to be a higher-level interface to forking.

To answer your question directly, there must be some side effect of external_process that makes it so that when the code is run in series, you get different results than if you run them at the same time. This is due to how you set up your code, and the lack of differences between os.fork and multiprocessing.Process in systems that os.fork is supported.
The only real difference between the os.fork and multiprocessing.Process is portability and library overhead, since os.fork is not supported in windows, and the multiprocessing framework is included to make multiprocessing.Process work. This is because os.fork is called by multiprocessing.Process, as this answer backs up.
The important distinction, then, is os.fork copies everything in the current process using Unix's forking, which means at the time of forking both processes are the same with PID differences. In Window's, this is emulated by rerunning all the setup code before the if __name__ == '__main__':, which is roughly the same as creating a subprocess using the subprocess library.
For you, the code snippets you provide are doing fairly different things above, because you call external_function in main before you open the new process in the second code clip, making the two processes run in series but in different processes. Also the pipe is unnecessary, as it emulates no functionality from the first code.
In Unix, the code snippets:
import os
pid = os.fork()
if pid == 0:
os.environ['HOME'] = "rep1"
external_function()
else:
os.environ['HOME'] = "rep2"
external_function()
and:
import os
from multiprocessing import Process
def f():
os.environ['HOME'] = "rep1"
external_function()
if __name__ == '__main__':
p = Process(target=f)
p.start()
os.environ['HOME'] = "rep2"
external_function()
p.join()
should do exactly the same thing, but with a little extra overhead from the included multiprocessing library.
Without further information, we can't figure out what the issue is. If you can provide code that demonstrates the issue, that would help us help you.

Related

How to prevent multiprocessing from inheriting imports and globals?

I'm using multiprocessing in a larger code base where some of the import statements have side effects. How can I run a function in a background process without having it inherit global imports?
# helper.py:
print('This message should only print once!')
# main.py:
import multiprocessing as mp
import helper # This prints the message.
def worker():
pass # Unfortunately this also prints the message again.
if __name__ == '__main__':
mp.set_start_method('spawn')
process = mp.Process(target=worker)
process.start()
process.join()
Background: Importing TensorFlow initializes CUDA which reserves some amount of GPU memory. As a result, spawing too many processes leads to a CUDA OOM error, even though the processes don't use TensorFlow.
Similar question without an answer:
How to avoid double imports with the Python multiprocessing module?

Is there a resources that explains exactly what the multiprocessing
module does when starting an mp.Process?
Super quick version (using the spawn context not fork)
Some stuff (a pair of pipes for communication, cleanup callbacks, etc) is prepared then a new process is created with fork()exec(). On windows it's CreateProcessW(). The new python interpreter is called with a startup script spawn_main() and passed the communication pipe file descriptors via a crafted command string and the -c switch. The startup script cleans up the environment a little bit, then unpickles the Process object from its communication pipe. Finally it calls the run method of the process object.
So what about importing of modules?
Pickle semantics handle some of it, but __main__ and sys.modules need some tlc, which is handled here (during the "cleans up the environment" bit).

# helper.py:
print('This message should only print once!')
# main.py:
import multiprocessing as mp
def worker():
pass
def main():
# Importing the module only locally so that the background
# worker won't import it again.
import helper
mp.set_start_method('spawn')
process = mp.Process(target=worker)
process.start()
process.join()
if __name__ == '__main__':
main()

Python subprocess always waits for programm [duplicate]

I'm trying to port a shell script to the much more readable python version. The original shell script starts several processes (utilities, monitors, etc.) in the background with "&". How can I achieve the same effect in python? I'd like these processes not to die when the python scripts complete. I am sure it's related to the concept of a daemon somehow, but I couldn't find how to do this easily.

While jkp's solution works, the newer way of doing things (and the way the documentation recommends) is to use the subprocess module. For simple commands its equivalent, but it offers more options if you want to do something complicated.
Example for your case:
import subprocess
subprocess.Popen(["rm","-r","some.file"])
This will run rm -r some.file in the background. Note that calling .communicate() on the object returned from Popen will block until it completes, so don't do that if you want it to run in the background:
import subprocess
ls_output=subprocess.Popen(["sleep", "30"])
ls_output.communicate() # Will block for 30 seconds
See the documentation here.
Also, a point of clarification: "Background" as you use it here is purely a shell concept; technically, what you mean is that you want to spawn a process without blocking while you wait for it to complete. However, I've used "background" here to refer to shell-background-like behavior.

Note: This answer is less current than it was when posted in 2009. Using the subprocess module shown in other answers is now recommended in the docs
(Note that the subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using these functions.)
If you want your process to start in the background you can either use system() and call it in the same way your shell script did, or you can spawn it:
import os
os.spawnl(os.P_DETACH, 'some_long_running_command')
(or, alternatively, you may try the less portable os.P_NOWAIT flag).
See the documentation here.

You probably want the answer to "How to call an external command in Python".
The simplest approach is to use the os.system function, e.g.:
import os
os.system("some_command &")
Basically, whatever you pass to the system function will be executed the same as if you'd passed it to the shell in a script.

I found this here:
On windows (win xp), the parent process will not finish until the longtask.py has finished its work. It is not what you want in CGI-script. The problem is not specific to Python, in PHP community the problems are the same.
The solution is to pass DETACHED_PROCESS Process Creation Flag to the underlying CreateProcess function in win API. If you happen to have installed pywin32 you can import the flag from the win32process module, otherwise you should define it yourself:
DETACHED_PROCESS = 0x00000008
pid = subprocess.Popen([sys.executable, "longtask.py"],
creationflags=DETACHED_PROCESS).pid

Use subprocess.Popen() with the close_fds=True parameter, which will allow the spawned subprocess to be detached from the Python process itself and continue running even after Python exits.
https://gist.github.com/yinjimmy/d6ad0742d03d54518e9f
import os, time, sys, subprocess
if len(sys.argv) == 2:
time.sleep(5)
print 'track end'
if sys.platform == 'darwin':
subprocess.Popen(['say', 'hello'])
else:
print 'main begin'
subprocess.Popen(['python', os.path.realpath(__file__), '0'], close_fds=True)
print 'main end'

Both capture output and run on background with threading
As mentioned on this answer, if you capture the output with stdout= and then try to read(), then the process blocks.
However, there are cases where you need this. For example, I wanted to launch two processes that talk over a port between them, and save their stdout to a log file and stdout.
The threading module allows us to do that.
First, have a look at how to do the output redirection part alone in this question: Python Popen: Write to stdout AND log file simultaneously
Then:
main.py
#!/usr/bin/env python3
import os
import subprocess
import sys
import threading
def output_reader(proc, file):
while True:
byte = proc.stdout.read(1)
if byte:
sys.stdout.buffer.write(byte)
sys.stdout.flush()
file.buffer.write(byte)
else:
break
with subprocess.Popen(['./sleep.py', '0'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc1, \
subprocess.Popen(['./sleep.py', '10'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc2, \
open('log1.log', 'w') as file1, \
open('log2.log', 'w') as file2:
t1 = threading.Thread(target=output_reader, args=(proc1, file1))
t2 = threading.Thread(target=output_reader, args=(proc2, file2))
t1.start()
t2.start()
t1.join()
t2.join()
sleep.py
#!/usr/bin/env python3
import sys
import time
for i in range(4):
print(i + int(sys.argv[1]))
sys.stdout.flush()
time.sleep(0.5)
After running:
./main.py
stdout get updated every 0.5 seconds for every two lines to contain:
0
10
1
11
2
12
3
13
and each log file contains the respective log for a given process.
Inspired by: https://eli.thegreenplace.net/2017/interacting-with-a-long-running-child-process-in-python/
Tested on Ubuntu 18.04, Python 3.6.7.

You probably want to start investigating the os module for forking different threads (by opening an interactive session and issuing help(os)). The relevant functions are fork and any of the exec ones. To give you an idea on how to start, put something like this in a function that performs the fork (the function needs to take a list or tuple 'args' as an argument that contains the program's name and its parameters; you may also want to define stdin, out and err for the new thread):
try:
pid = os.fork()
except OSError, e:
## some debug output
sys.exit(1)
if pid == 0:
## eventually use os.putenv(..) to set environment variables
## os.execv strips of args[0] for the arguments
os.execv(args[0], args)

You can use
import os
pid = os.fork()
if pid == 0:
Continue to other code ...
This will make the python process run in background.

I haven't tried this yet but using .pyw files instead of .py files should help. pyw files dosen't have a console so in theory it should not appear and work like a background process.

Python losing control of subprocess?

I'm using a commercial application that uses Python as part of its scripting API. One of the functions provided is something called App.run(). When this function is called, it starts a new Java process that does the rest of the execution. (Unfortunately, I don't really know what it's doing under the hood as the supplied Python modules are .pyc files, and many of the Python functions are SWIG generated).
The trouble I'm having is that I'm building the App.run() call into a larger Python application that needs to do some guaranteed cleanup code (closing a database, etc.). Unfortunately, if the subprocess is interrupted with Ctrl+C, it aborts and returns to the command line without returning control to the main Python program. Thus, my cleanup code never executes.
So far I've tried:
Registering a function with atexit... doesn't work
Putting cleanup in a class __del__ destructor... doesn't work. (App.run() is inside the class)
Creating a signal handler for Ctrl+C in the main Python app... doesn't work
Putting App.run() in a Thread... results in a Memory Fault after the Ctrl+C
Putting App.run() in a Process (from multiprocessing)... doesn't work
Any ideas what could be happening?

This is just an outline- but something like this?
import os
cpid = os.fork()
if not cpid:
# change stdio handles etc
os.setsid() # Probably not needed
App.run()
os._exit(0)
os.waitpid(cpid)
# clean up here
(os.fork is *nix only)
The same idea could be implemented with subprocess in an OS agnostic way. The idea is running App.run() in a child process and then waiting for the child process to exit; regardless of how the child process died. On posix, you could also trap for SIGCHLD (Child process death). I'm not a windows guru, so if applicable and subprocess doesn't work, someone else will have to chime in here.
After App.run() is called, I'd be curious what the process tree looks like. It's possible its running an exec and taking over the python process space. If thats happening, creating a child process is the only way I can think of trapping it.

If try: App.run() finally: cleanup() doesn't work; you could try to run it in a subprocess:
import sys
from subprocess import call
rc = call([sys.executable, 'path/to/run_app.py'])
cleanup()
Or if you have the code in a string you could use -c option e.g.:
rc = call([sys.executable, '-c', '''import sys
print(sys.argv)
'''])
You could implement #tMC's suggestion using subprocess by adding
preexec_fn=os.setsid argument (note: no ()) though I don't see how creating a process group might help here. Or you could try shell=True argument to run it in a separate shell.
You might give another try to multiprocessing:
import multiprocessing as mp
if __name__=="__main__":
p = mp.Process(target=App.run)
p.start()
p.join()
cleanup()

Are you able to wrap the App.Run() in a Try/Catch?
Something like:
try:
App.Run()
except (KeyboardInterrupt, SystemExit):
print "User requested an exit..."
cleanup()

Python multiprocessing continuously spawns pythonw.exe processes without doing any actual work

I don't understand why this simple code
# file: mp.py
from multiprocessing import Process
import sys
def func(x):
print 'works ', x + 2
sys.stdout.flush()
p = Process(target= func, args= (2, ))
p.start()
p.join()
p.terminate()
print 'done'
sys.stdout.flush()
creates "pythonw.exe" processes continuously and it doesn't print anything, even though I run it from the command line:
python mp.py
I am running the latest of Python 2.6 on Windows 7 both 32 and 64 bits

You need to protect then entry point of the program by using if __name__ == '__main__':.
This is a Windows specific problem. On Windows your module has to be imported into a new Python interpreter in order for it to access your target code. If you don't stop this new interpreter running the start up code it will spawn another child, which will then spawn another child, until it's pythonw.exe processes as far as the eye can see.
Other platforms use os.fork() to launch the subprocesses so don't have the problem of reimporting the module.
So your code will need to look like this:
from multiprocessing import Process
import sys
def func(x):
print 'works ', x + 2
sys.stdout.flush()
if __name__ == '__main__':
p = Process(target= func, args= (2, ))
p.start()
p.join()
p.terminate()
print 'done'
sys.stdout.flush()

According to the programming guidelines for multiprocessing, on windows you need to use an if __name__ == '__main__':

Funny, works on my Linux machine:
$ python mp.py
works 4
done
$
Is the multiprocessing thing supposed to work on Windows? A lot of programs originated in the Unix world don't handle Windows so well, because Unix uses fork(2) to clone processes quite cheaply, but (it is my understanding) that Windows does not support fork(2) gracefully, if at all.

How to tell process id within Python

I am working with a cluster system over linux (www.mosix.org) that allows me to run jobs and have the system run them on different computers. Jobs are run like so:
mosrun ls &
This will naturally create the process and run it on the background, returning the process id, like so:
[1] 29199
Later it will return. I am writing a Python infrastructure that would run jobs and control them. For that I want to run jobs using the mosrun program as above, and save the process ID of the spawned process (29199 in this case). This naturally cannot be done using os.system or commands.getoutput, as the printed ID is not what the process prints to output... Any clues?
Edit:
Since the python script is only meant to initially run the script, the scripts need to run longer than the python shell. I guess it means the mosrun process cannot be the script's child process. Any suggestions?
Thanks

Use subprocess module. Popen instances have a pid attribute.

Looks like you want to ensure the child process is daemonized -- PEP 3143, which I'm pointing to, documents and points to a reference implementation for that, and points to others too.
Once your process (still running Python code) is daemonized, be it by the means offered in PEP 3143 or others, you can os.execl (or other os.exec... function) your target code -- this runs said target code in exactly the same process which we just said is daemonized, and so it keeps being daemonized, as desired.
The last step cannot use subprocess because it needs to run in the same (daemonized) process, overlaying its executable code -- exactly what os.execl and friends are for.
The first step, before daemonization, might conceivably be done via subprocess, but that's somewhat inconvenient (you need to put the daemonize-then-os.exec code in a separate .py): most commonly you'd just want to os.fork and immediately daemonize the child process.
subprocess is quite convenient as a mostly-cross-platform way to run other processes, but it can't really replace Unix's good old "fork and exec" approach for advanced uses (such as daemonization, in this case) -- which is why it's a good thing that the Python standard library also lets you do the latter via those functions in module os!-)

Thanks all for the help. Here's what I did in the end, and seems to work ok. The code uses python-daemon. Maybe something smarter should be done about transferring the process id from the child to the father, but that's the easier part.
import daemon
def run_in_background(command, tmp_dir="/tmp"):
# Decide on a temp file beforehand
warnings.filterwarnings("ignore", "tempnam is a potential security")
tmp_filename = os.tempnam(tmp_dir)
# Duplicate the process
pid = os.fork()
# If we're child, daemonize and run
if pid == 0:
with daemon.DaemonContext():
child_id = os.getpid()
file(tmp_filename,'w').write(str(child_id))
sp = command.split(' ')
os.execl(*([sp[0]]+sp))
else:
# If we're a parent, poll for the new file
n_iter = 0
while True:
if os.path.exists(tmp_filename):
child_id = int(file(tmp_filename, 'r').read().strip())
break
if n_iter == 100:
raise Exception("Cannot read process id from temp file %s" % tmp_filename)
n_iter += 1
time.sleep(0.1)
return child_id

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difference in behavior between os.fork and multiprocessing.Process - python

Related

How to prevent multiprocessing from inheriting imports and globals?

Python subprocess always waits for programm [duplicate]

Python losing control of subprocess?

Python multiprocessing continuously spawns pythonw.exe processes without doing any actual work

How to tell process id within Python

Categories

Resources