I want to write a Python script which will check every minute if some pre-defined process is still running on Linux machine and if it doesn't print a timestamp at what time it has crashed. I have written a script which is doing exactly that but unfortunately, it works correctly with only one process.
This is my code:
import subprocess
import shlex
import time
from datetime import datetime
proc_def = "top"
grep_cmd = "pgrep -a " + proc_def
try:
proc_run = subprocess.check_output(shlex.split(grep_cmd)).decode('utf-8')
proc_run = proc_run.strip().split('\n')
'''
Creating a dictionary with key the PID of the process and value
the command line
'''
proc_dict = dict(zip([i.split(' ', 1)[0] for i in proc_run],
[i.split(' ', 1)[1] for i in proc_run]))
check_run = "ps -o pid= -p "
for key, value in proc_dict.items():
check_run_cmd = check_run + key
try:
# While the output of check_run_cmd isn't empty line do
while subprocess.check_output(
shlex.split(check_run_cmd)
).decode('utf-8').strip():
# This print statement is for debugging purposes only
print("Running")
time.sleep(3)
'''
If the check_run_cmd is returning an error, it shows us the time
and date of the crash as well as the PID and the command line
'''
except subprocess.CalledProcessError as e:
print(f"PID: {key} of command: \"{value}\" stopped
at {datetime.now().strftime('%d-%m-%Y %T')}")
exit(1)
# Check if the proc_def is actually running on the machine
except subprocess.CalledProcessError as e:
print(f"The \"{proc_def}\" command isn't running on this machine")
For example, if there are two top processes it will show me information about the crash time of only one of these processes and it will exit. I want to stay active as long as there is another process running and exit only if both processes are killed. It should present information when each of the processes has crashed.
It shall also not be limited to two proc only and support multiple processes started with the same proc_def command.
Have to change the logic a bit, but basically you want an infinite loop alternating a check on all processes - not checking the same one over and over:
import subprocess
import shlex
import time
from datetime import datetime
proc_def = "top"
grep_cmd = "pgrep -a " + proc_def
try:
proc_run = subprocess.check_output(shlex.split(grep_cmd)).decode('utf-8')
proc_run = proc_run.strip().split('\n')
'''
Creating a dictionary with key the PID of the process and value
the command line
'''
proc_dict = dict(zip([i.split(' ', 1)[0] for i in proc_run],
[i.split(' ', 1)[1] for i in proc_run]))
check_run = "ps -o pid= -p "
while proc_dict:
for key, value in proc_dict.items():
check_run_cmd = check_run + key
try:
# While the output of check_run_cmd isn't empty line do
subprocess.check_output(shlex.split(check_run_cmd)).decode('utf-8').strip()
# This print statement is for debugging purposes only
print("Running")
time.sleep(3)
except subprocess.CalledProcessError as e:
print(f"PID: {key} of command: \"{value}\" stopped at {datetime.now().strftime('%d-%m-%Y %T')}")
del proc_dict[key]
break
# Check if the proc_def is actually running on the machine
except subprocess.CalledProcessError as e:
print(f"The \"{proc_def}\" command isn't running on this machine")
This suffers from the same problems in the original code, namely the time resolution is 3 seconds, and if a new process is run during this script, you won't ping it (though this may be desired).
The first problem would be fixed by sleeping for less time, depending on what you need, the second by running the initial lines creating proc_dict in the while True.
Related
What is the most pythonic syntax for getting subprocess to successfully manage the running of the following CLI command, which can take a long time to complete?
CLI Command:
The CLI command that subprocess must run is:
az resource invoke-action --resource-group someRG --resource-type Microsoft.VirtualMachineImages/imageTemplates -n somename78686786976 --action Run
The CLI command runs for a long time, for example 11 minutes in this case, but possibly longer at other times.
While run from the terminal manually, the terminal prints the following while the command is waiting to hear back that it has succeeded:
\ Running
The \ spins around while the command runs when the command is manually typed in the terminal.
The response that is eventually given back when the command finally succeeds is the following JSON:
{
"endTime": "2022-06-23T02:54:02.6811671Z",
"name": "long-alpha-numerica-string-id",
"startTime": "2022-06-23T02:43:39.2933333Z",
"status": "Succeeded"
}
CURRENT PYTHON CODE:
The current python code we are using to run the above command from within a python program is as follows:
def getJsonResponse(self, cmd,counter=0):
process = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, text=True)
data = process.stdout
err = process.stderr
logString = "data string is: " + data
print(logString)
logString = "err is: " + str(err)
print(logString)
logString = "process.returncode is: " + str(process.returncode)
print(logString)
if process.returncode == 0:
print(str(data))
return data
else:
if counter < 11:
counter +=1
logString = "Attempt "+str(counter)+ " out of 10. "
print(logString)
import time
time.sleep(30)
data = self.getShellJsonResponse(cmd,counter)
return data
else:
logString = "Error: " + str(err)
print(logString)
logString = "Error: Return Code is: " + str(process.returncode)
print(logString)
logString = "ERROR: Failed to return Json response. Halting the program so that you can debug the cause of the problem."
quit(logString)
sys.exit(1)
CURRENT PROBLEM:
The problem we are getting with the above is that our current python code above reports a process.returncode of 1 and then recursively continues to call the python function again and again while the CLI command is running instead of simply reporting that the CLI command is still running.
And our current recursive approach does not take into account what is actually happening since the CLI command was first called, and instead just blindly repeats up to 10 times for up to 5 minutes, when the actual process might take 10 to 20 minutes to complete.
What is the most pythonic way to rewrite the above code in order to gracefully report that the CLI command is running for however long it takes to complete, and then return the JSON given above when the
command finally completes?
I'm not sure if my code is pythoic, but I think it's better to run it in Popen.
I can't test the CLI command you should execute, so I replaced it with the netstat command, which takes a long time to respond.
import subprocess
import time
def getJsonResponse(cmd):
process = subprocess.Popen(
cmd,
encoding='utf-8',
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
while(True):
returncode = process.poll()
if returncode is None:
# You describe what is going on.
# You can describe the process every time the time elapses as needed.
# print("running process")
time.sleep(0.01)
data = process.stdout
if data:
# If there is any response, describe it here.
# You need to use readline () or readlines () properly, depending on how the process responds.
msg_line = data.readline()
print(msg_line)
err = process.stderr
if err:
# If there is any error response, describe it here.
msg_line = err.readline()
print(msg_line)
else:
print(returncode)
break
# Describes the processing after the process ends.
print("terminate process")
getJsonResponse(cmd=['netstat', '-a'])
Why subprocess.PIPE prevents a called executable from closing.
I use the following script to call an executable file with a number of inputs:
import subprocess, time
CREATE_NO_WINDOW = 0x08000000
my_proc = subprocess.Popen("myApp.exe " + ' '.join([str(input1), str(input2), str(input3)]),
startupinfo=subprocess.STARTUPINFO(), stdout=subprocess.PIPE,
creationflags = CREATE_NO_WINDOW)
Then I monitor if the application has finished within a given time (300 seconds) and if not I just kill it. I also read the output of the application to know whether it failed in doing the required tasks.
proc_wait_time = 300
start_time = time.time()
sol_status = 'Fail'
while time.time() - start_time < proc_wait_time:
if (my_proc.poll() is None):
time.sleep(1)
else:
try:
sol_status = my_proc.stdout.read().replace('\r\n \r\n','')
break
except:
sol_status = 'Fail'
break
else:
try: my_proc.kill()
except: None
sol_status = 'Frozen'
if sol_status in ['Fail', 'Frozen']:
print ('Failed running my_proc')
As you can note from the code I need to wait for myApp.exe to finish, however, sometimes myApp.exe freezes. Since the script above is part of a loop, I need to identify such a situation (by a timer), keep track of it and kill myApp.exe so that the whole script doesn't get stuck!
Now, the issue is that if I use subprocess.PIPE (which I suppose I have to if I want read the output of the application) then myApp.exe doesn't close after finishing and consequently my_proc.poll() is None is always True.
I am using Python 2.7.
There was a pipe buffer limit/bug in case of huge amounts of data written to subprocess.PIPE. The easiest way to fix it is to pipe the data directly into a file:
_stdoutHandler = open('C:/somePath/stdout.log', 'w')
_stderrHandler = open('C:/somePath/stderr.log', 'w')
my_proc = subprocess.Popen(
"myApp.exe " + ' '.join([str(input1), str(input2), str(input3)]),
stdout=_stdoutHandler,
stderr=_stderrHandler,
startupinfo=subprocess.STARTUPINFO(),
creationflags=CREATE_NO_WINDOW
)
...
_stdoutHandler.close()
_stderrHandler.close()
It is explained in https://stackoverflow.com/a/18422264/7238575 how one can run a subprocess and read out the results live. However, it looks like it creates a file with a name test.log to do so. This makes me worry that if multiple scripts are using this trick in the same directory the test.log file might well be corrupted. Is there a way that does not require a file to be created outside Python? Or can we make sure that each process uses a unique log file? Or am I completely misunderstanding the situation and is there no risk of simultaneous writes by different programs to the same test.log file?
You don't need to write the live output to a file. You can write it to simply to STDOUT with sys.stdout.write("your message").
On the other hand you can generate unique log files for each process:
import os
import psutil
pid = psutil.Process(os.getpid())
process_name = pid.name()
path, extension = os.path.splitext(os.path.join(os.getcwd(), "my_basic_log_file.log"))
created_log_file_name = "{0}_{1}{2}".format(path, process_name, extension)
print(created_log_file_name)
Output:
>>> python3 test_1.py
/home/my_user/test_folder/my_basic_log_file_python3.log
If you see the above example my process name was python3 so this process name was inserted to the "basic" log file name. With this solution you can create unique log files for your processes.
You can set your process name with the setproctitle.setproctitle("my_process_name").
Here is an example.
import os
import psutil
import setproctitle
setproctitle.setproctitle("milan_balazs")
pid = psutil.Process(os.getpid())
process_name = pid.name()
path, extension = os.path.splitext(os.path.join(os.getcwd(), "my_basic_log_file.log"))
created_log_file_name = "{0}_{1}{2}".format(path, process_name, extension)
print(created_log_file_name)
Output:
>>> python3 test_1.py
/home/my_user/test_folder/my_basic_log_file_milan_balazs.log
Previously I have written a quite complex and safe command caller which can make live output (not to file). You can check it:
import sys
import os
import subprocess
import select
import errno
def poll_command(process, realtime):
"""
Watch for error or output from the process
:param process: the process, running the command
:param realtime: flag if realtime logging is needed
:return: Return STDOUT and return code of the command processed
"""
coutput = ""
poller = select.poll()
poller.register(process.stdout, select.POLLIN)
fdhup = {process.stdout.fileno(): 0}
while sum(fdhup.values()) < len(fdhup):
try:
r = poller.poll(1)
except select.error as err:
if not err.args[0] == errno.EINTR:
raise
r = []
for fd, flags in r:
if flags & (select.POLLIN | select.POLLPRI):
c = version_conversion(fd, realtime)
coutput += c
else:
fdhup[fd] = 1
return coutput.strip(), process.poll()
def version_conversion(fd, realtime):
"""
There are some differences between Python2/3 so this conversion is needed.
"""
c = os.read(fd, 4096)
if sys.version_info >= (3, 0):
c = c.decode("ISO-8859-1")
if realtime:
sys.stdout.write(c)
sys.stdout.flush()
return c
def exec_shell(command, real_time_out=False):
"""
Call commands.
:param command: Command line.
:param real_time_out: If this variable is True, the output of command is logging in real-time
:return: Return STDOUT and return code of the command processed.
"""
if not command:
print("Command is not available.")
return None, None
print("Executing '{}'".format(command))
rtoutput = real_time_out
p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
out, return_code = poll_command(p, rtoutput)
if p.poll():
error_msg = "Return code: {ret_code} Error message: {err_msg}".format(
ret_code=return_code, err_msg=out
)
print(error_msg)
print("[OK] - The command calling was successful. CMD: '{}'".format(command))
return out, return_code
exec_shell("echo test running", real_time_out=True)
Output:
>>> python3 test.py
Executing 'echo test running'
test running
[OK] - The command calling was successful. CMD: 'echo test running'
I hope my answer answers your question! :)
I am running a python script that may or may not take few hours to complete.
In the beginning of my python script, I want to check if this python script is already running or not.
If it is already running, I want to exit my current python that I just started.
For example:
python started 1AM and keeps on running until 3AM
started another one at 2AM without knowing it is already running.
I want my 2AM python to check and exit since it is already running.
How can I write this python?
This is what I tried for locking..
try:
l = lock.lock("/home/auto.py", timeout=600) # wait at most 10 minutes
except error.LockHeld:
e = sys.exc_info()[0]
logging.error("Error: " + str(e) + " at main gatering Stats")
smtpObj.sendmail(sender, receivers, message + "Error: " + str(e) + " at main gatering stats")
exit("Fail: " + str(e) + " at main gathering Stats")
else:
l.release()
so I thought this will wait for 10 minutes if it is still running then exit.. if it is not running anymore, then run the current python
You can try using the lockfile-create command with the r flag to retry a specified amount of times catching a CalledProcessError and exiting, the -p flag will store the pid of the process :
import os
import sys
from time import sleep
from subprocess import check_call, CalledProcessError
try:
check_call(["lockfile-create", "-q","-p", "-r", "0", "-l", "my.lock"])
except CalledProcessError as e:
print("{} is already running".format(sys.argv[0]))
print(e.returncode)
exit(1)
# main body
for i in range(10):
sleep(2)
print(1)
check_call(["rm","-f","my.lock"])
Running a test.py script with the code above while one is already running outputs the following:
$ python lock.py
lock.py is already running
4
Options
-q, --quiet
Suppress any output. Success or failure will only be indicated by the exit status.
-v, --verbose
Enable diagnostic output.
-l, --lock-name
Do not append .lock to the filename. This option applies to lockfile-create, lockfile-remove, lockfile-touch, or lockfile-check.
-p, --use-pid
Write the current process id (PID) to the lockfile whenever a lockfile is created, and use that pid when checking a lock's validity. See the lockfile_create(3) manpage for more information. This option applies to lockfile-create, lockfile-remove, lockfile-touch, and lockfile-check.
-o, --oneshot
Touch the lock and exit immediately. This option applies to lockfile-touch and mail-touchlock. When not provided, these commands will run forever, touching the lock once every minute until killed.
-r retry-count, --retry retry-count
Try to lock filename retry-count times before giving up. Each attempt will be delayed a bit longer than the last (in 5 second increments) until reaching a maximum delay of one minute between retries. If retry-count is unspecified, the default is 9 which will give up after 180 seconds (3 minutes) if all 9 lock attempts fail.
Description
The lockfile_create function creates a lockfile in an NFS safe way.
If flags is set to L_PID then lockfile_create will not only check for an existing lockfile, but it will read the contents as well to see if it contains a process id in ASCII. If so, the lockfile is only valid if that process still exists.
If the lockfile is on a shared filesystem, it might have been created by a process on a remote host. Thus the process-id checking is useless and the L_PID flag should not be set. In this case, there is no good way to see if a lockfile is stale. Therefore if the lockfile is older then 5 minutes, it will be removed. That is why the lockfile_touch function is provided: while holding the lock, it needs to be refreshed regularly (every minute or so) by calling lockfile_touch ().
The lockfile_check function checks if a valid lockfile is already present without trying to create a new lockfile.
Finally the lockfile_remove function removes the lockfile.
The Algorithm
The algorithm that is used to create a lockfile in an atomic way, even over NFS, is as follows:
1
A unique file is created. In printf format, the name of the file is .lk%05d%x%s. The first argument (%05d) is the current process id. The second argument (%x) consists of the 4 minor bits of the value returned by time(2). The last argument is the system hostname.
2
Then the lockfile is created using link(2). The return value of link is ignored.
3
Now the lockfile is stat()ed. If the stat fails, we go to step 6.
4
The stat value of the lockfile is compared with that of the temporary file. If they are the same, we have the lock. The temporary file is deleted and a value of 0 (success) is returned to the caller.
5
A check is made to see if the existing lockfile is a valid one. If it isn't valid, the stale lockfile is deleted.
6
Before retrying, we sleep for n seconds. n is initially 5 seconds, but after every retry 5 extra seconds is added up to a maximum of 60 seconds (an incremental backoff). Then we go to step 2 up to retries times.
There seems to be an equivalent package called lockfile-progs on redhat.
On mac you could use lockfile and do something like:
import os
import sys
from time import sleep
import os
from subprocess import Popen, CalledProcessError, check_call
p = Popen(["lockfile", "-r", "0", "my.lock"])
p.wait()
if p.returncode == 0:
with open("my.pid", "w") as f:
f.write(str(os.getpid()))
else:
try:
with open("my.pid") as f:
# see if process is still running or lockfile
# is left over from previous run.
r = f.read()
check_call(["kill", "-0", "{}".format(r)])
except CalledProcessError:
# remove old lock file and create new
check_call(["rm", "-f", "my.lock"])
check_call(["lockfile", "-r", "0", "my.lock"])
# update pid
with open("my.pid", "w") as out:
out.write(str(os.getpid()))
print("Deleted stale lockfile.")
else:
print("{} is already running".format(sys.argv[0]))
print(p.returncode)
exit(1)
# main body
for i in range(10):
sleep(1)
print(1)
check_call(["rm", "-f", "my.lock"])
In your case maybe using a socket would work:
from socket import socket, gethostname, error, SO_REUSEADDR, SOL_SOCKET
from sys import argv
import errno
sock = socket()
# Create a socket object
host = gethostname()
# /proc/sys/net/ipv4/ip_local_port_range is 32768 61000 on my Ubuntu Machine
port = 60001
# allow connection in TIME_WAIT
sock.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
try:
sock.bind((host, port))
sock.connect((host, port))
except error as e:
# [Errno 99] Cannot assign requested address
if e.errno == errno.EADDRNOTAVAIL:
print("{} is already running".format(argv[0]))
exit(1)
# else raise the error
else:
raise e
# main body
from time import sleep
while True:
print(1)
sleep(2)
sock.close()
I'm writing some monitoring scripts in Python and I'm trying to find the cleanest way to get the process ID of any random running program given the name of that program
something like
ps -ef | grep MyProgram
I could parse the output of that however I thought there might be a better way in python
From the standard library:
os.getpid()
If you are not limiting yourself to the standard library, I like psutil for this.
For instance to find all "python" processes:
>>> import psutil
>>> [p.info for p in psutil.process_iter(attrs=['pid', 'name']) if 'python' in p.info['name']]
[{'name': 'python3', 'pid': 21947},
{'name': 'python', 'pid': 23835}]
Try pgrep. Its output format is much simpler and therefore easier to parse.
Also:
Python: How to get PID by process name?
Adaptation to previous posted answers.
def getpid(process_name):
import os
return [item.split()[1] for item in os.popen('tasklist').read().splitlines()[4:] if process_name in item.split()]
getpid('cmd.exe')
['6560', '3244', '9024', '4828']
With psutil:
(can be installed with [sudo] pip install psutil)
import psutil
# Get current process pid
current_process_pid = psutil.Process().pid
print(current_process_pid) # e.g 12971
# Get pids by program name
program_name = 'chrome'
process_pids = [process.pid for process in psutil.process_iter() if process.name == program_name]
print(process_pids) # e.g [1059, 2343, ..., ..., 9645]
For Windows
A Way to get all the pids of programs on your computer without downloading any modules:
import os
pids = []
a = os.popen("tasklist").readlines()
for x in a:
try:
pids.append(int(x[29:34]))
except:
pass
for each in pids:
print(each)
If you just wanted one program or all programs with the same name and you wanted to kill the process or something:
import os, sys, win32api
tasklistrl = os.popen("tasklist").readlines()
tasklistr = os.popen("tasklist").read()
print(tasklistr)
def kill(process):
process_exists_forsure = False
gotpid = False
for examine in tasklistrl:
if process == examine[0:len(process)]:
process_exists_forsure = True
if process_exists_forsure:
print("That process exists.")
else:
print("That process does not exist.")
raw_input()
sys.exit()
for getpid in tasklistrl:
if process == getpid[0:len(process)]:
pid = int(getpid[29:34])
gotpid = True
try:
handle = win32api.OpenProcess(1, False, pid)
win32api.TerminateProcess(handle, 0)
win32api.CloseHandle(handle)
print("Successfully killed process %s on pid %d." % (getpid[0:len(prompt)], pid))
except win32api.error as err:
print(err)
raw_input()
sys.exit()
if not gotpid:
print("Could not get process pid.")
raw_input()
sys.exit()
raw_input()
sys.exit()
prompt = raw_input("Which process would you like to kill? ")
kill(prompt)
That was just a paste of my process kill program I could make it a whole lot better but it is okay.
For posix (Linux, BSD, etc... only need /proc directory to be mounted) it's easier to work with os files in /proc
Works on python 2 and 3 ( The only difference is the Exception tree, therefore the "except Exception", which i dislike but kept to maintain compatibility. Also could've created custom exception.)
#!/usr/bin/env python
import os
import sys
for dirname in os.listdir('/proc'):
if dirname == 'curproc':
continue
try:
with open('/proc/{}/cmdline'.format(dirname), mode='rb') as fd:
content = fd.read().decode().split('\x00')
except Exception:
continue
for i in sys.argv[1:]:
if i in content[0]:
# dirname is also the number of PID
print('{0:<12} : {1}'.format(dirname, ' '.join(content)))
Sample Output (it works like pgrep):
phoemur ~/python $ ./pgrep.py bash
1487 : -bash
1779 : /bin/bash
This is a simplified variation of Fernando's answer. This is for Linux and either Python 2 or 3. No external library is needed, and no external process is run.
import glob
def get_command_pid(command):
for path in glob.glob('/proc/*/comm'):
if open(path).read().rstrip() == command:
return path.split('/')[2]
Only the first matching process found will be returned, which works well for some purposes. To get the PIDs of multiple matching processes, you could just replace the return with yield, and then get a list with pids = list(get_command_pid(command)).
Alternatively, as a single expression:
For one process:
next(path.split('/')[2] for path in glob.glob('/proc/*/comm') if open(path).read().rstrip() == command)
For multiple processes:
[path.split('/')[2] for path in glob.glob('/proc/*/comm') if open(path).read().rstrip() == command]
The task can be solved using the following piece of code, [0:28] being interval where the name is being held, while [29:34] contains the actual pid.
import os
program_pid = 0
program_name = "notepad.exe"
task_manager_lines = os.popen("tasklist").readlines()
for line in task_manager_lines:
try:
if str(line[0:28]) == program_name + (28 - len(program_name) * ' ': #so it includes the whitespaces
program_pid = int(line[29:34])
break
except:
pass
print(program_pid)