Python multiprocessing copy-on-write behaving differently between OSX and Ubuntu

Python multiprocessing copy-on-write behaving differently between OSX and Ubuntu - python

I'm trying to share objects between the parent and child process in Python. To play around with the idea, I've created a simple Python script:
from multiprocessing import Process
from os import getpid
import psutil
shared = list(range(20000000))
def shared_printer():
mem = psutil.Process(getpid()).memory_info().rss / (1024 ** 2)
print(getpid(), len(shared), '{}MB'.format(mem))
if __name__ == '__main__':
p = Process(target=shared_printer)
p.start()
shared_printer()
p.join()
The code snippet uses the excellent psutil library to print the RSS (Resident Set Size). When I run this on OSX with Python 2.7.15, I get the following output:
(33101, 20000000, '1MB')
(33100, 20000000, '626MB')
When I run the exact same snippet on Ubuntu (Linux 4.15.0-1029-aws #30-Ubuntu SMP x86_64 GNU/Linux), I get the following output:
(4077, 20000000, '632MB')
(4078, 20000000, '629MB')
Notice that the child process' RSS is basicall 0MB on OSX and about the same size as the parent process' RSS in Linux. I had assumed that copy-on-write behavior would work the same way in Linux and allow the child process to refer to the parent process' memory for most pages (perhaps except the one storing the head of the object).
So I'm guessing that there's some difference in the copy-on-write behavior in the 2 systems. My question is: is there anything I can do in Linux to get that OSX-like copy-on-write behavior?

So I'm guessing that there's some difference in the copy-on-write behavior >in the 2 systems. My question is: is there anything I can do in Linux to >get that OSX-like copy-on-write behavior?
The answer is NO. Behind the command psutil.Process(getpid()).memory_info().rss / (1024 ** 2) the OS uses the UNIX command $top [PID] and search for the field RES. Which contains the non-swapped physical memory a task has used in kb. i.e. RES = CODE + DATA.
IMHO, these means that both OS uses different memory managers. So that, it's almost impossible to constrain how much memory a process uses/needs. This is a intern issue of the OS.
In Linux the child process has the same size of the parent process. Indeed, they copy the same stack, code and data. But different PCB (Process Control Block). Therefore, it is impossible to get close to 0 as OSX does. It smells that OSX does not literally copy the code and data. If they are the same code, it will make pointer to the data of the parent process.
PD: I hope that would help you!

Related

Process getting stuck after being launched from another process

I was working on a specific type of application in Dash which required the action executed by pressing the button to be performed in a separate process. This process, in turn, was parallelizable, and in some cases spawned child processes for the efficient computation. The configuration given in this cases makes the child processes to get stuck. The code below reproduces the situation described as follows:
import multiprocessing
import time
import dash
from dash import html
from dash.dependencies import Input, Output
app = dash.Dash(__name__)
app.layout = html.Div([
html.Button(id='refresh-button', children='Button'),
html.Div(id='dynamic-container1')
])
def run_function(i):
print('hello')
time.sleep(15)
print(f'hello world {i}')
def run_process():
num = 1
print('hello world 00000')
process = multiprocessing.Process(target=run_function, args=(num,))
process.start()
process.join()
print('hello world')
#app.callback(Output('dynamic-container1', 'children'), Input('refresh-button', 'n_clicks'))
def refresh_state(click):
if click == 0 or click is None:
return None
p = multiprocessing.Process(target=run_process)
p.start()
p.join()
return None
if __name__ == '__main__':
app.run_server(debug=True)
The output of this application when pressing the button is always the following:
Connected to pydev debugger (build 172.3968.37)
Dash is running on http://127.0.0.1:8050/
* Serving Flask app 'main' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: on
pydev debugger: process 6884 is connecting
hello world 00000
which means that the first process utilizing the function run_process() was launched, however, the child process run_function(i) was not even started. I was trying to find an explanation in popular books on multiprocessing in Python and any guidance to these "chaining" processes, but to no avail. From my understanding, the new child process run_function(i) should occupy a separate core (if there is any free core) and not to depend on the resources consumed by the parent process run_process(). Could you, please, explain to me the mechanics of this? I have a doubt that in this code run_function(i) might be coerced to consume the same resources as run_process() does, so the system basically is just restricting any new process from starting from the same resources, but I would like to confirm it from more expert users of Python.
I used Python 3.7 and Pycharm Community 2017.2.3 on Win7 to reproduce this example

I can not reproduce your error and i think this is because you are using windows ( i am on Linux with python 3.9), so i can not find the error for you, but maybe i can give you some hints:
First: To find the error, try to reduce the code to the core of the problem (you can remove the whole dash stuff to check if this is the error). In my tests the results were the same, with and without dash
Second: Windows and Linux handle the multiprocessing stuff a bit different:
windows spawns the process:
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
Unix Systems fork the process
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
With Unix System you can spawn (multiprocessing.set_start_method("spawn")), with this i can't reproduce you error, but with this fork/spawn example i wanted to make clear, that there sometimes things are a bit different between windows and linux, even if the same packages are used. I think your understanding of multiprocessing is correct. (Maybe this site helps too.)
Third: In the docs are some programming guidelines you should be aware of. Maybe they will help, too. And in general, the multiprocessing package does not work well with a lot of interactive python shells (like IDLE or pycharm), this can may be better on newer versions. Maybe you should try it from terminal to check if this changes something.
I hope this helps a litle bit.

Process, memory and network resource tracer

I would like to try to make a process, memory and network resource tracer similar to the one that comes by default in ubuntu for any operating system. But being new in python I don't know how to get these values to be displayed (in principle by console, then I'll do them as graphics). Which library would be easier to do it with?

On linux, you could leverage the /proc filesystem to read the information you need for such a task.
the /proc filesystem is a window into the kernel with a lot of data on each process running. It is displayed as a virtual filesystem, meaning you can access all that info simply by reading and parsing files.
For instance,
from pathlib import Path
proc = Path('/proc')
for proc in proc.iterdir():
if not proc.name.isnumeric():
continue # ignore directories that aren't processes
pid = proc.name
cmdline = (proc / 'cmdline').read_text()
print(f'PROCESS : {pid} : {cmdline}')
This will list all processes running, along with their commandline.
You have a lot of information you can gather in there.
more info on /proc here

How to kill subprocess of xdg?

I spent a lot of time searching for the answer for my question, but I could not find it.
I run xdm-open, using subrocess, to play a video (I do not want to know what applications are available)
I'm waiting for a while
I want to kill the video
import os
import subprocess
import psutil
from time import sleep;
from signal import SIGTERM
file = "test.mkv"
out = subprocess.Popen(['xdg-open', file])
pid = out.pid
print('sleeping...')
sleep(20.0)
print('end of sleep...')
os.kill(pid, SIGTERM) #alternatively: out.terminate()
Unfortunatelly the last line is killing only the xdg-open process. The mplayer process (which was started by xdg) is still running.
I tried to get the sub-processes of the xdg by using the following code:
main_process = psutil.Process(pid)
children_processes = main_process.children(recursive=True)
for child in children_processes:
print("child process: ", child.pid, child.name())
but it did not help either. The list was empty.
Has anybody an idea how to kill the player process?

Programs like xdg-open typically look for a suitable program to open a file with, start that program with the file as argument and then exit.
By the time you get around to checking for child processes, xdg-open has probably already exited.
What happens then is OS dependant. In what follows, I'll be talking about UNIX-like operating systems.
The processes launched by xdg-open will usually get PID 1 as their parent process id (PPID) after xdg-open exits, so it will be practically impossible to find out for certain who started them by looking at the PPID.
But, there will probably be a relatively small number of processes running under your user-ID with PPID 1, so if you list those before and after calling xdg-open and remove all the programs that were in the before-list from the after-list, the program you seek will be in the after-list. Unless your machine is very busy, chances are that there will be only one item in the after-list; the one started by xdg-open.
Edit 1:
You commented:
I want to make the app OS independent.
All operating systems that support xdg-open are basically UNIX-like operating systems. If you use the psutil Python module to get process information, you can run your "app" on all the systems that psutil supports:
Linux
macOS
FreeBSD, OpenBSD, NetBSD
Sun Solaris
AIX
(psutil even works on ms-windows, but I kind of doubt you will find xdg-open there...)

Multiprocessing efficiency confusion

I'm running a Python job utilizing the Multiprocessing package and here's the issue. When I run with 3 processors on my dual-core hyper-threaded laptop I hit 100% CPU usage in each core no problem. I also have a workstation with 6 cores, hyper-threaded, and when I run the same script on that machine each core barely breaks 30%. Can someone explain why this is? I was thinking it was I/O but if that's the case then my laptop shouldn't be utilized 100%, right?
Code below with short explanation:
MultiprocessingPoolWithState is a custom class that fire up N_Workers workers and gives each of them a copy of a dataframe (so that the df isn't shipped over the wire to each worker for each operation). And tups is a list of tuples which are used as the slicing criteria for each operation that process_data() does.
Here's an example of the code:
import multiprocessing as mp
config = dict()
N_Workers = mp.cpu_count-1
def process_data(tup):
global config
df = config['df']
id1 = tup[0]
id2 = tup[1]
df_want = df.loc[(df.col1 == id1) & (df.col2 == id2)]
""" DO STUFF """
return series_i_want
pool = MultiprocessingPoolWithState(n=N_Workers, state=df)
results = pool.map(process_data,tups)
I'm not sure what other details anyone would need so I'll add what I can (I can't give the custom class as it's not mine but a co-worker's). The main thing is that my laptop maxes out cpu usage but my workstation doesn't.

For those who might be curious here I think I've figured it out (although this answer won't be highly technical). Within the """ DO STUFF """ I call statsmodels.x13.x13_arima_analysis() which is a Python wrapper for X13-Arima-SEATS which is a sales adjustment program the US Census Bureau creates for seasonally adjusting time series (like sales records). I only had one copy of the x13.exe that the wrapper calls so on my laptop (Windows 10) I think the OS created copies of the .exe automatically for each process however on our Server (Windows Server 2012 R2) each process was waiting in line for its own chance at using the .exe - so on our server I had an I/O issue that my laptop handled automatically. The solution was simple - create a .exe with a unique path for each process to utilize and boom the program is 300 times faster than before.
I do not understand how the different OS's handled the issue with multiple processes looking at the same .exe but that's my theory which seemed to be confirmed when adding the process-dependent paths.

How to start daemon process from python on windows?

Can my python script spawn a process that will run indefinitely?
I'm not too familiar with python, nor with spawning deamons, so I cam up with this:
si = subprocess.STARTUPINFO()
si.dwFlags = subprocess.CREATE_NEW_PROCESS_GROUP | subprocess.CREATE_NEW_CONSOLE
subprocess.Popen(executable, close_fds = True, startupinfo = si)
The process continues to run past python.exe, but is closed as soon as I close the cmd window.

Using the answer Janne Karila pointed out this is how you can run a process that doen't die when its parent dies, no need to use the win32process module.
DETACHED_PROCESS = 8
subprocess.Popen(executable, creationflags=DETACHED_PROCESS, close_fds=True)
DETACHED_PROCESS is a Process Creation Flag that is passed to the underlying CreateProcess function.

This question was asked 3 years ago, and though the fundamental details of the answer haven't changed, given its prevalence in "Windows Python daemon" searches, I thought it might be helpful to add some discussion for the benefit of future Google arrivees.
There are really two parts to the question:
Can a Python script spawn an independent process that will run indefinitely?
Can a Python script act like a Unix daemon on a Windows system?
The answer to the first is an unambiguous yes; as already pointed out; using subprocess.Popen with the creationflags=subprocess.CREATE_NEW_PROCESS_GROUP keyword will suffice:
import subprocess
independent_process = subprocess.Popen(
'python /path/to/file.py',
creationflags=subprocess.CREATE_NEW_PROCESS_GROUP
)
Note that, at least in my experience, CREATE_NEW_CONSOLE is not necessary here.
That being said, the behavior of this strategy isn't quite the same as what you'd expect from a Unix daemon. What constitutes a well-behaved Unix daemon is better explained elsewhere, but to summarize:
Close open file descriptors (typically all of them, but some applications may need to protect some descriptors from closure)
Change the working directory for the process to a suitable location to prevent "Directory Busy" errors
Change the file access creation mask (os.umask in the Python world)
Move the application into the background and make it dissociate itself from the initiating process
Completely divorce from the terminal, including redirecting STDIN, STDOUT, and STDERR to different streams (often DEVNULL), and prevent reacquisition of a controlling terminal
Handle signals, in particular, SIGTERM.
The reality of the situation is that Windows, as an operating system, really doesn't support the notion of a daemon: applications that start from a terminal (or in any other interactive context, including launching from Explorer, etc) will continue to run with a visible window, unless the controlling application (in this example, Python) has included a windowless GUI. Furthermore, Windows signal handling is woefully inadequate, and attempts to send signals to an independent Python process (as opposed to a subprocess, which would not survive terminal closure) will almost always result in the immediate exit of that Python process without any cleanup (no finally:, no atexit, no __del__, etc).
Rolling your application into a Windows service, though a viable alternative in many cases, also doesn't quite fit. The same is true of using pythonw.exe (a windowless version of Python that ships with all recent Windows Python binaries). In particular, they fail to improve the situation for signal handling, and they cannot easily launch an application from a terminal and interact with it during startup (for example, to deliver dynamic startup arguments to your script, say, perhaps, a password, file path, etc), before "daemonizing". Additionally, Windows services require installation, which -- though perfectly possible to do quickly at runtime when you first call up your "daemon" -- modifies the user's system (registry, etc), which would be highly unexpected if you're coming from a Unix world.
In light of that, I would argue that launching a pythonw.exe subprocess using subprocess.CREATE_NEW_PROCESS_GROUP is probably the closest Windows equivalent for a Python process to emulate a traditional Unix daemon. However, that still leaves you with the added challenge of signal handling and startup communications (not to mention making your code platform-dependent, which is always frustrating).
That all being said, for anyone encountering this problem in the future, I've rolled a library called daemoniker that wraps both proper Unix daemonization and the above strategy. It also implements signal handling (for both Unix and Windows systems), and allows you to pass objects to the "daemon" process using pickle. Best of all, it has a cross-platform API:
from daemoniker import Daemonizer
with Daemonizer() as (is_setup, daemonizer):
if is_setup:
# This code is run before daemonization.
do_things_here()
# We need to explicitly pass resources to the daemon; other variables
# may not be correct
is_parent, my_arg1, my_arg2 = daemonizer(
path_to_pid_file,
my_arg1,
my_arg2
)
if is_parent:
# Run code in the parent after daemonization
parent_only_code()
# We are now daemonized, and the parent just exited.
code_continues_here()

For that purpose you could daemonize your python process or as you are using windows environment you would like to run this as a windows service.
You know i like to hate posting only web-links:
But for more information according to your requirement:
A simple way to implement Windows Service. read all comments it will resolve any doubt
If you really want to learn more
First read this
what is daemon process or creating-a-daemon-the-python-way
update:
Subprocess is not the right way to achieve this kind of thing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.