Multithreading to parse stream of data from multiple files

Multithreading to parse stream of data from multiple files - python

I have an python program which parse the stream of data as shown below
tail -F /path1/restapi.log -F /path2/restapi.log | parse.py
parse.py is parsing data from sys.stdin.readline
import re
import sys
import json
def deep_get(dictionary, keys, default=None):
return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)
regexp_date_status = re.compile(r'(\d+-\d+-\d+ \d+:\d+:\d+.\d+\+\d+) (\w+)')
while True:
line = sys.stdin.readline()
if not line:
break
if re.search(r'Request #\d+: {', line):
date_status = regexp_date_status.match(line)
json_str = '{\n'
while True:
json_str += sys.stdin.readline()
try:
d = json.loads(json_str) # we have our dictionary, perhaps
except Exception:
pass
else:
username = (deep_get(d,"context.authorization.authUserName", default="Username not found"))
hostname = (deep_get(d,"context.headers.X-Forwarded-For"))
uri = (deep_get(d,"context.uri"))
verb = (deep_get(d,"context.verb"))
print("State->{} : Date->{} : User->{} : Host->{} : URI->{} : Verb->{}".format(date_status.group(2), date_status.group(1), username,hostname,uri,verb))
break
I would like to do multithreading as number of files can increase upto 30
tail -F /path1/restapi.log -F /path2/restapi.log /path3/restapi.log -F /path4/restapi.log .... | parse.py
How do I divide the work among the threads in this case as data is streamed and parsed until I get the valid dictionary in try block ? also do I need to leverage Queues here ?

Let bash deal with the multiple instances of parse.py.
Something similar to:
echo -e 'file1.log\nfile2.log\nfile3.log'| xargs -n 1 --max-procs 10 -I % sh -c 'tail -f % |parse.py'
xargs will handle things in multiple instances.
Note the '\n' in the files list.
Example to play round with:
echo -e "hello\nto\nyou" |xargs -n 1 --max-procs 2 -I % sh -c 'sleep 3; echo %'
This will use two threads to do the sleep and echo.
The result will be 'hello' and 'to' with a delay before 'you' is seen.

Related

Killing process by its command name [duplicate]

Is there any way I can get the PID by process name in Python?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3110 meysam 20 0 971m 286m 63m S 14.0 7.9 14:24.50 chrome
For example I need to get 3110 by chrome.

You can get the pid of processes by name using pidof through subprocess.check_output:
from subprocess import check_output
def get_pid(name):
return check_output(["pidof",name])
In [5]: get_pid("java")
Out[5]: '23366\n'
check_output(["pidof",name]) will run the command as "pidof process_name", If the return code was non-zero it raises a CalledProcessError.
To handle multiple entries and cast to ints:
from subprocess import check_output
def get_pid(name):
return map(int,check_output(["pidof",name]).split())
In [21]: get_pid("chrome")
Out[21]:
[27698, 27678, 27665, 27649, 27540, 27530, 27517, 14884, 14719, 13849, 13708, 7713, 7310, 7291, 7217, 7208, 7204, 7189, 7180, 7175, 7166, 7151, 7138, 7127, 7117, 7114, 7107, 7095, 7091, 7087, 7083, 7073, 7065, 7056, 7048, 7028, 7011, 6997]
Or pas the -s flag to get a single pid:
def get_pid(name):
return int(check_output(["pidof","-s",name]))
In [25]: get_pid("chrome")
Out[25]: 27698

You can use psutil package:
Install
pip install psutil
Usage:
import psutil
process_name = "chrome"
pid = None
for proc in psutil.process_iter():
if process_name in proc.name():
pid = proc.pid
break
print("Pid:", pid)

you can also use pgrep, in prgep you can also give pattern for match
import subprocess
child = subprocess.Popen(['pgrep','program_name'], stdout=subprocess.PIPE, shell=True)
result = child.communicate()[0]
you can also use awk with ps like this
ps aux | awk '/name/{print $2}'

For posix (Linux, BSD, etc... only need /proc directory to be mounted) it's easier to work with os files in /proc.
It's pure python, no need to call shell programs outside.
Works on python 2 and 3 ( The only difference (2to3) is the Exception tree, therefore the "except Exception", which I dislike but kept to maintain compatibility. Also could've created a custom exception.)
#!/usr/bin/env python
import os
import sys
for dirname in os.listdir('/proc'):
if dirname == 'curproc':
continue
try:
with open('/proc/{}/cmdline'.format(dirname), mode='rb') as fd:
content = fd.read().decode().split('\x00')
except Exception:
continue
for i in sys.argv[1:]:
if i in content[0]:
print('{0:<12} : {1}'.format(dirname, ' '.join(content)))
Sample Output (it works like pgrep):
phoemur ~/python $ ./pgrep.py bash
1487 : -bash
1779 : /bin/bash

Complete example based on the excellent #Hackaholic's answer:
def get_process_id(name):
"""Return process ids found by (partial) name or regex.
>>> get_process_id('kthreadd')
[2]
>>> get_process_id('watchdog')
[10, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61] # ymmv
>>> get_process_id('non-existent process')
[]
"""
child = subprocess.Popen(['pgrep', '-f', name], stdout=subprocess.PIPE, shell=False)
response = child.communicate()[0]
return [int(pid) for pid in response.split()]

To improve the Padraic's answer: when check_output returns a non-zero code, it raises a CalledProcessError. This happens when the process does not exists or is not running.
What I would do to catch this exception is:
#!/usr/bin/python
from subprocess import check_output, CalledProcessError
def getPIDs(process):
try:
pidlist = map(int, check_output(["pidof", process]).split())
except CalledProcessError:
pidlist = []
print 'list of PIDs = ' + ', '.join(str(e) for e in pidlist)
if __name__ == '__main__':
getPIDs("chrome")
The output:
$ python pidproc.py
list of PIDS = 31840, 31841, 41942

if you're using windows,
you can get PID of process/app with it's image name with this code:
from subprocess import Popen, PIPE
def get_pid_of_app(app_image_name):
final_list = []
command = Popen(['tasklist', '/FI', f'IMAGENAME eq {app_image_name}', '/fo', 'CSV'], stdout=PIPE, shell=False)
msg = command.communicate()
output = str(msg[0])
if 'INFO' not in output:
output_list = output.split(app_image_name)
for i in range(1, len(output_list)):
j = int(output_list[i].replace("\"", '')[1:].split(',')[0])
if j not in final_list:
final_list.append(j)
return final_list
it will return you all PID of a app like firefox or chrome e.g.
>>> get_pid_of_app("firefox.exe")
[10908, 4324, 1272, 6936, 1412, 2824, 6388, 1884]
let me know if it helped

If your OS is Unix base use this code:
import os
def check_process(name):
output = []
cmd = "ps -aef | grep -i '%s' | grep -v 'grep' | awk '{ print $2 }' > /tmp/out"
os.system(cmd % name)
with open('/tmp/out', 'r') as f:
line = f.readline()
while line:
output.append(line.strip())
line = f.readline()
if line.strip():
output.append(line.strip())
return output
Then call it and pass it a process name to get all PIDs.
>>> check_process('firefox')
['499', '621', '623', '630', '11733']

Since Python 3.5, subprocess.run() is recommended over subprocess.check_output():
>>> int(subprocess.run(["pidof", "-s", "your_process"], stdout=subprocess.PIPE).stdout)
Also, since Python 3.7, you can use the capture_output=true parameter to capture stdout and stderr:
>>> int(subprocess.run(["pidof", "-s", "your process"], capture_output=True).stdout)

On Unix, you can use pyproc2 package.
Installation
pip install pyproc2
Usage
import pyproc2
chrome_pid=pyproc2.find("chrome").pid #Returns PID of first process with name "chrome"

My code has an if-else condition but i think the code is not checking the if condition instead going straight to else condition

I have this code to change the filesystem size. The problem is that even though the if condition is satisfied it is not going into if condition and i do not have if it is checking the if condition at all. It is straight forwardly going to the else condition.
The result after running the code
post-install-ray.py chpasswd
chpasswd root:Deepsky24
df -g | grep -w / | awk '{ print $3 }'
0.89
df -g | grep -w /usr | awk '{print $3 }'
5.81
chfs -a size=+4G / Filesystem size changed to 10354688
Code :
import subprocess
import sys
import pexpect
from pexpect import pxssh
import logging
password = "Deepsky24"
hostname = "host"
username = "root"
def change_filesystem():
child = pxssh.pxssh()
child.login(hostname, username, password)
child.sendline("df -g | grep -w / | awk '{ print $3 }'")
child.prompt()
output1 = child.before
print output1
child.sendline("df -g | grep -w /usr | awk '{print $3 }'")
child.prompt()
output2 = child.before
print output2
if (output1 <= 1 and output2 >= 4):
child.sendline('chfs -a size=-3G /usr')
child.prompt()
print(child.before)
child.sendline('chfs -a size=+3G /')
child.prompt()
print(child.before)
else:
child.sendline('chfs -a size=+4G /')
child.prompt()
print(child.before)
if __name__== "__main__":
change_filesystem()

Your two outputs are still strings, you need to convert them to floats
output1 = float(child.before)
output2 = float(child.before)

I guess your are on python 2.
child.before return a string.
you comparing a string to integer and getting unexpected results.
cast your outputs with int() or float()
with python 3 you whould have got an exception

Extracting CPU use for specific process periodically

I have a running demo on a Linux server which consumes quite a bit of CPU and Memory usage. These parameters keep changing based on the load of the running demo. I want to extract the CPU usage and Memory usage periodically , i.e. every 3-4 sec and create a plot of the extracted result.
Considering the process as " Running Demo", on the terminal I typed:
ps aux |grep Running Demo | awk '{print $3 $4}'
This gives me the CPU and Memory usage of Running Demo. But I want the next two things i.e.
1) get this result outputted every 3-4 sec.
2) Make a plot of the generated result.
Any help or suggestion will be highly appreciated. I am a starter in this community.
Thanks

What you are trying to do is well known as an existing project :
See Munin
EXAMPLE
NOTE
it's supported by the Open Source community, so...
it will be stronger
don't run odd commands like ps aux |grep Running Demo | awk '{print $3 $4}' but ps auxw | awk '/Running Demo/{print $3 $4}'
many plugins exists and works for basics : CPU, RAM, FW, Apache and many more
if you really need gnuplot, see a top 3 on a goggle search http://blah.token.ro/post/249956031/using-gnuplot-to-graph-process-cpu-usage

the following python script accepts an output filename (png), and one or more pids. when you press ctrl-C it stops and uses gnuplot to generate a nice graph.
#!/usr/bin/env python
import os
import tempfile
import time
import sys
def total(pids):
return [sum(map(int, file('/proc/%s/stat' % pid).read().split()[13:17])) for pid in pids]
def main():
if len(sys.argv) == 1 or sys.argv[1] == '-h':
print 'log.py output.png pid1 pid2..'
return
pids = sys.argv[2:]
results = []
prev = total(pids)
try:
while True:
new = total(pids)
result = [(new[i]-prev[i])/0.1 for i, pid in enumerate(pids)]
results.append(result)
time.sleep(0.1)
prev = new
except KeyboardInterrupt:
pass
t1, t2 = tempfile.mkstemp()[1], tempfile.mkstemp()[1]
f1, f2 = file(t1, 'w'), file(t2, 'w')
print
print 'data: %s' % t1
print 'plot: %s' % t2
for result in results:
print >>f1, ' '.join(map(str, result))
print >>f2, 'set terminal png size %d,480' % (len(results)*5)
print >>f2, "set out '%s'" % sys.argv[1]
print >>f2, 'plot ' + ', '.join([("'%s' using ($0/10):%d with linespoints title '%s'" % (t1, i+1, pid)) for i, pid in enumerate(pids)])
f1.close()
f2.close()
os.system('gnuplot %s' % t2)
if __name__ == '__main__':
main()

How do I avoid processing an empty stdin with python?

The sys.stdin.readline() waits for an EOF (or new line) before returning, so if I have a console input, readline() waits for user input. Instead I want to print help and exit with an error if there is nothing to process, not wait for user input.
Reason:
I'm looking to write a python program with command line behaviour similar to grep.
Test cases:
No input and nothing piped, print help
$ argparse.py
argparse.py - prints arguments
echo $? # UNIX
echo %ERRORLEVEL% # WINDOWS
2
Command line args parsed
$ argparse.py a b c
0 a
1 b
2 c
Accept piped commands
$ ls | argparse.py
0 argparse.py
1 aFile.txt
parseargs.py listing:
# $Id: parseargs.py
import sys
import argparse
# Tried these too:
# import fileinput - blocks on no input
# import subprocess - requires calling program to be known
def usage():
sys.stderr.write("{} - prints arguments".fomrat(sys.argv[0])
sys.stderr.flush()
sys.exit(2)
def print_me(count, msg):
print '{}: {:>18} {}'.format(count, msg.strip(), map(ord,msg))
if __name__ == '__main__':
USE_BUFFERED_INPUT = False
# Case 1: Command line arguments
if len(sys.argv) > 1:
for i, arg in enumerate(sys.argv[1:]):
print_me( i, arg)
elif USE_BUFFERED_INPUT: # Note: Do not use processing buffered inputs
for i, arg in enumerate(sys.stdin):
print_me( i, arg)
else:
i=0
##### Need to deterime if the sys.stdin is empty.
##### if READLINE_EMPTY:
##### usage()
while True:
arg = sys.stdin.readline() #Blocks if no input
if not arg:
break
print_me( i, arg)
i += 1
sys.exit(0)

grep can work the way it does because it has one non-optional argument: the pattern. For example
$ grep < /dev/zero
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
even though there was infinite input available on stdin, grep didn't get the required argument and therefore complained.
If you want to use only optional arguments and error out if stdin is a terminal, look at file.isatty().

import sys,os
print os.fstat(sys.stdin.fileno()).st_size > 0
Calling script
c:\py_exp>peek_stdin.py < peek_stdin.py
True
c:\py_exp>peek_stdin.py
False

You may want to check getopt module. Basic example:
import getopt
import sys
def main(argv):
try:
opts, args = getopt.getopt(argv, "has:f:") # "has:f:" are the arguments
except getopt.GetoptError:
print "print usage()"
sys.exit(1)
if not opts and not args:
print "print usage()"
sys.exit(1)
print "args passed", opts, args
if __name__ == "__main__":
main(sys.argv[1:])
~> python blabla.py
print usage()
~> python blabla.py -a arg
args passed [('-a', '')] ['arg']
~> python blabla.py -b as ----> this fails because -b is not defined for getopt at second parameter
print usage()
What about this one:
#!/usr/bin/env python
import getopt
import sys
import select
def main(argv):
try:
opts, args = getopt.getopt(argv, "has:f:") # "has:f:" are the arguments
except getopt.GetoptError:
print "print usage()"
sys.exit(1)
if not opts and not args:
a, b, c = select.select([sys.stdin], [], [], 0.2)
if a:
itera = iter(a[0].readline, "")
for line in itera:
data = line.strip()
print data
else:
print "print usage()"
print "args passed", opts, args
if __name__ == "__main__":
main(sys.argv[1:])
select.select helps to check if there is data coming
:~> ./hebele.py
print usage()
args passed [] []
:~> ping www.google.com | ./hebele.py
PING www.google.com (173.194.67.105) 56(84) bytes of data.
64 bytes from blabla (173.194.67.105): icmp_seq=1 ttl=48 time=16.7 ms
64 bytes from blabla (173.194.67.105): icmp_seq=2 ttl=48 time=17.1 ms
64 bytes from blabla (173.194.67.105): icmp_seq=3 ttl=48 time=17.1 ms
^CTraceback (most recent call last):
File "./hebele.py", line 25, in <module>
main(sys.argv[1:])
File "./hebele.py", line 17, in main
for line in itera:
KeyboardInterrupt
:~> ls | ./hebele.py
Aptana_Studio_3
Desktop
...
workspace
args passed [] []
:~> ./hebele.py -a bla
args passed [('-a', '')] ['bla']
:~> ./hebele.py sdfsdf sadf sdf
args passed [] ['sdfsdf', 'sadf', 'sdf']

How to get output from external command combine with Pipe

I have command like this.
wmctrl -lp | awk '/gedit/ { print $1 }'
And I want its output within python script, i tried this code
>>> import subprocess
>>> proc = subprocess.Popen(["wmctrl -lp", "|","awk '/gedit/ {print $1}"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> proc.stdout.readline()
'0x0160001b -1 6504 beer-laptop x-nautilus-desktop\n'
>>> proc.stdout.readline()
'0x0352f117 0 6963 beer-laptop How to get output from external command combine with Pipe - Stack Overflow - Chromium\n'
>>> proc.stdout.readline()
'0x01400003 -1 6503 beer-laptop Bottom Expanded Edge Panel\n'
>>>
It seem my code is wrong only wmctrl -lp was execute, and | awk '{print $1}' is omitted
My expect output would like 0x03800081
$ wmctrl -lp | awk '/gedit/ {print $1}'
0x03800081
Does one please help.

With shell=True, you should use a single command line instead of an array, otherwise your additional arguments are interpreted as shell arguments. From the subprocess documentation:
On Unix, with shell=True: If args is a string, it specifies the command string to execute through the shell. If args is a sequence, the first item specifies the command string, and any additional items will be treated as additional shell arguments.
So your call should be:
subprocess.Popen("wmctrl -lp | sed /gedit/ '{print $1}'", shell=True, ...
I think you may also have an unbalanced single quote in there.

Because you are passing a sequence in for the program, it thinks that the pipe is an argument to wmcrtrl, such as if you did
wmctrl -lp "|"
and thus the actual pipe operation is lost.
Making it a single string should indeed give you the correct result:
>>> import subprocess as s
>>> proc = s.Popen("echo hello | grep e", shell=True, stdout=s.PIPE, stderr=s.PIPE)
>>> proc.stdout.readline()
'hello\n'
>>> proc.stdout.readline()
''

After some research, I have the following code which works very well for me. It basically prints both stdout and stderr in real time. Hope it helps someone else who needs it.
stdout_result = 1
stderr_result = 1
def stdout_thread(pipe):
global stdout_result
while True:
out = pipe.stdout.read(1)
stdout_result = pipe.poll()
if out == '' and stdout_result is not None:
break
if out != '':
sys.stdout.write(out)
sys.stdout.flush()
def stderr_thread(pipe):
global stderr_result
while True:
err = pipe.stderr.read(1)
stderr_result = pipe.poll()
if err == '' and stderr_result is not None:
break
if err != '':
sys.stdout.write(err)
sys.stdout.flush()
def exec_command(command, cwd=None):
if cwd is not None:
print '[' + ' '.join(command) + '] in ' + cwd
else:
print '[' + ' '.join(command) + ']'
p = subprocess.Popen(
command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd
)
out_thread = threading.Thread(name='stdout_thread', target=stdout_thread, args=(p,))
err_thread = threading.Thread(name='stderr_thread', target=stderr_thread, args=(p,))
err_thread.start()
out_thread.start()
out_thread.join()
err_thread.join()
return stdout_result + stderr_result
When needed, I think it's easy to collect the output or error in a string and return.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multithreading to parse stream of data from multiple files - python

Related

Killing process by its command name [duplicate]

My code has an if-else condition but i think the code is not checking the if condition instead going straight to else condition

Extracting CPU use for specific process periodically

How do I avoid processing an empty stdin with python?

How to get output from external command combine with Pipe

Categories

Resources