I'm trying to perform a sed/awk style regex substitution with python3's re module.
You can see it works fine here with a hardcoded test string:
#!/usr/bin/env python3
import re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
line = ("21:21:54.165651 stat64 this/ 0.000012 THiNG1.12471\n"
"21:21:54.165652 stat64 /that 0.000012 2thIng.12472\n"
"21:21:54.165653 stat64 /and/the other thing.xml 0.000012 With S paces.12473\n"
"21:21:54.165654 stat64 /and/the_other_thing.xml 0.000012 without_em_.12474\n"
"21:59:57.774616 fstat64 F=90 0.000002 tmux.4129\n")
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result)
But I'm having some trouble getting it to work the same way with the stdin:
#!/usr/bin/env python3
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
for line in str(sys.stdin):
#sys.stdout.write(line)
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result,end='')
I'd like to be able to pipe input straight into it from another utility, like is common with grep and similar CLI utilities.
Any idea what the issue is here?
Addendum
I tried to keep the question simple and generalized in the hope that answers might be more useful in similar but different situations, and useful to more people. However, the details might shed some more light on the problem, so here I will include are the exact details of my current scenario:
The desired input to my script is actually the output stream from a utility called fs_usage, it's similar to utilities like ps, but provides a constant stream of system calls and filesystem operations. It tells you which files are being read from, written to, etc. in real time.
From the manual:
NAME
fs_usage -- report system calls and page faults related to filesystem activity in real-time
DESCRIPTION
The fs_usage utility presents an ongoing display of system call usage information pertaining to filesystem activity. It requires root privileges due to the kernel tracing facility it uses to operate.
By default, the activity monitored includes all system processes except for:
fs_usage, Terminal.app, telnetd, telnet, sshd, rlogind, tcsh, csh, sh, zsh. These defaults can be overridden such that output is limited to include or exclude (-e) a list of processes specified by the user.
The output presented by fs_usage is formatted according to the size of your window.
A narrow window will display fewer columns. Use a wide window for maximum data display.
You may override the formatting restrictions by forcing a wide display with the -w option.
In this case, the data displayed will wrap when the window is not wide enough.
I hack together a crude little bash script to rip the process names from the stream, and dump them to a temporary log file. You can think of it as a filter or an extractor. Here it is as a function that will dump straight to stdout (remove the comment on the last line to dump to file).
proc_enum ()
{
while true; do
sudo fs_usage -w -e 'grep' 'awk' |
grep -E -o '(?:\d\.\d{6})\s{3}\S+\.\d+' |
awk '{print $2}' |
awk -F '.' '{print $1}' \
#>/tmp/proc_names.logx
done
}
Useful Links
Regular Expressions 101
Stack Overflow - How to pipe input to python line by line from linux program?
The problem str(sys.stdin) what Python will do in for loop is this:
i = iter(str(sys.stdin))
# then in every iteration
next(i)
Here you are converting the method to str, result in my computer is:
str(sys.stdin) == "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='cp1256'>"
you are not looping on the lines received by stdin, you are looping on the string representation of the function.
And another problem in the first example you are applying the re.sub on the entire text
but here you are applying for each line, so you should concatenate the result of each
line or concatenate the lines in single text before applying re.sub.
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
result = ''
for line in sys.stdin:
# here you should convert the input but I think is optional
line = str(line)
result += re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result, end='')
Related
I am storing the number of files in a directory in a variable and storing their names in an array. I'm unable to store file names in the array.
Here is the piece of code I have written.
import os
temp = os.system('ls -l /home/demo/ | wc -l')
no_of_files = temp - 1
command = "ls -l /home/demo/ | awk 'NR>1 {print $9}'"
file_list=[os.system(command)]
for i in range(len(file_list))
os.system('tail -1 file_list[i]')
Your shell scripting is orders of magnitude too complex.
output = subprocess.check_output('tail -qn1 *', shell=True)
or if you really prefer,
os.system('tail -qn1 *')
which however does not capture the output in a Python variable.
If you have a recent-enough Python, you'll want to use subprocess.run() instead. You can also easily let Python do the enumeration of the files to avoid the pesky shell=True:
output = subprocess.check_output(['tail', '-qn1'] + os.listdir('.'))
As noted above, if you genuinely just want the output to be printed to the screen and not be available to Python, you can of course use os.system() instead, though subprocess is recommended even in the os.system() documentation because it is much more versatile and more efficient to boot (if used correctly). If you really insist on running one tail process per file (perhaps because your tail doesn't support the -q option?) you can do that too, of course:
for filename in os.listdir('.'):
os.system("tail -n 1 '%s'" % filename)
This will still work incorrectly if you have a file name which contains a single quote. There are workarounds, but avoiding a shell is vastly preferred (so back to subprocess without shell=True and the problem of correctly coping with escaping shell metacharacters disappears because there is no shell to escape metacharacters from).
for filename in os.listdir('.'):
print(subprocess.check_output(['tail', '-n1', filename]))
Finally, tail doesn't particularly do anything which cannot easily be done by Python itself.
for filename in os.listdir('.'):
with open (filename, 'r') as handle:
for line in handle:
pass
# print the last one only
print(line.rstrip('\r\n'))
If you have knowledge of the expected line lengths and the files are big, maybe seek to somewhere near the end of the file, though obviously you need to know how far from the end to seek in order to be able to read all of the last line in each of the files.
os.system returns the exitcode of the command and not the output. Try using subprocess.check_output with shell=True
Example:
>>> a = subprocess.check_output("ls -l /home/demo/ | awk 'NR>1 {print $9}'", shell=True)
>>> a.decode("utf-8").split("\n")
Edit (as suggested by #tripleee) you probably don't want to do this as it will get crazy. Python has great functions for things like this. For example:
>>> import glob
>>> names = glob.glob("/home/demo/*")
will directly give you a list of files and folders inside that folder. Once you have this, you can just do len(names) to get the first command.
Another option is:
>>> import os
>>> os.listdir("/home/demo")
Here, glob will give you the whole filepath /home/demo/file.txt and os.listdir will just give you the filename file.txt
The ls -l /home/demo/ | wc -l command is also not the correct value as ls -l will show you "total X" on top mentioning how many total files it found and other info.
You could likely use a loop without much issue:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
with open(f, 'rb') as fh:
last = fh.readlines()[-1].decode()
print('file: {0}\n{1}\n'.format(f, last))
fh.close()
Output:
file.txt
Hello, World!
...
If your files are large then readlines() probably isn't the best option. Maybe go with tail instead:
for f in files:
print('file: {0}'.format(f))
subprocess.check_call(['tail', '-n', '1', f])
print('\n')
The decode is optional, although for text "utf-8" usually works or if it's a combination of binary/text/etc then maybe something such as "iso-8859-1" usually should work.
you are not able to store file names because os.system does not return output as you expect it to be. For more information see : this.
From the docs
On Unix, the return value is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell after running command, given by the Windows environment variable COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always 0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit status of the command run; on systems using a non-native shell, consult your shell documentation.
os.system executes linux shell commands as it is. for getting output for these shell commands you have to use python subprocess
Note : In your case you can get file names using either glob module or os.listdir(): see How to list all files of a directory
I'm a complete Python novice, but I need to create a small script as part of a larger project.
I'm trying to use a small Python script to input a variable into a line of code within a Unix file. I've been trying to use the subprocess.call() which enables me to execute the commands I need to won't allow me to place in the variable.
I have been trying the following:
import subprocess
var="some_variable"
subprocess.call(['sudo', 'sed', '-i', '-e' 's/line_to_replace.*/replacement_line' + var, /path/to/file.conf])
The most obvious failure here is that your code isn't putting a trailing / on the end of the replacement value to make it a valid sed command:
import subprocess
var="some_variable".replace('/', r'\/') # backslash-escape any sigil instances
subprocess.call(['sudo', 'sed', '-i',
'-e', 's/line_to_replace.*/replacement_line%s/' % (var,),
'/path/to/file.conf'])
That said, be aware that this won't work with POSIX-standard sed (which doesn't have -i), or BSD sed (for which the extension argument to sed -i is not optional). The POSIX-standard tools for editing files in-place are ed and ex; consider learning them.
Your immediate problem is a sed syntax error: You are missing the closing separator after the end of the replacement string (and the file name needs to be quoted). However, most simple Unix tools can easily be replaced with pure Python, often with immediate gains both in elegance and efficiency (avoiding an external process is frequently a significant performance improvement).
import fileinput
for line in fileinput('/path/to/file', inline=True):
if line.startswith('line_to_replace'):
line = var + '\n'
print line
If you genuinely need regular expression support, the change should be obvious. The sudo is harder to replace; perhaps you can simply run this from the prompt with sudo if obtaining write privileges to the file you want to manipulate is otherwise not trivial.
my python file has these 2 variables:
week_date = "01/03/16-01/09/16"
cust_id = "12345"
how can i read this into a shell script that takes in these 2 variables?
my current shell script requires manual editing of "dt" and "id". I want to read the python variables into the shell script so i can just edit my python parameter file and not so many files.
shell file:
#!/bin/sh
dt="01/03/16-01/09/16"
cust_id="12345"
In a new python file i could just import the parameter python file.
Consider something akin to the following:
#!/bin/bash
# ^^^^ NOT /bin/sh, which doesn't have process substitution available.
python_script='
import sys
d = {} # create a context for variables
exec(open(sys.argv[1], "r").read()) in d # execute the Python code in that context
for k in sys.argv[2:]:
print "%s\0" % str(d[k]).split("\0")[0] # ...and extract your strings NUL-delimited
'
read_python_vars() {
local python_file=$1; shift
local varname
for varname; do
IFS= read -r -d '' "${varname#*:}"
done < <(python -c "$python_script" "$python_file" "${#%%:*}")
}
You might then use this as:
read_python_vars config.py week_date:dt cust_id:id
echo "Customer id is $id; date range is $dt"
...or, if you didn't want to rename the variables as they were read, simply:
read_python_vars config.py week_date cust_id
echo "Customer id is $cust_id; date range is $week_date"
Advantages:
Unlike a naive regex-based solution (which would have trouble with some of the details of Python parsing -- try teaching sed to handle both raw and regular strings, and both single and triple quotes without making it into a hairball!) or a similar approach that used newline-delimited output from the Python subprocess, this will correctly handle any object for which str() gives a representation with no NUL characters that your shell script can use.
Running content through the Python interpreter also means you can determine values programmatically -- for instance, you could have some Python code that asks your version control system for the last-change-date of relevant content.
Think about scenarios such as this one:
start_date = '01/03/16'
end_date = '01/09/16'
week_date = '%s-%s' % (start_date, end_date)
...using a Python interpreter to parse Python means you aren't restricting how people can update/modify your Python config file in the future.
Now, let's talk caveats:
If your Python code has side effects, those side effects will obviously take effect (just as they would if you chose to import the file as a module in Python). Don't use this to extract configuration from a file whose contents you don't trust.
Python strings are Pascal-style: They can contain literal NULs. Strings in shell languages are C-style: They're terminated by the first NUL character. Thus, some variables can exist in Python than cannot be represented in shell without nonliteral escaping. To prevent an object whose str() representation contains NULs from spilling forward into other assignments, this code terminates strings at their first NUL.
Now, let's talk about implementation details.
${#%%:*} is an expansion of $# which trims all content after and including the first : in each argument, thus passing only the Python variable names to the interpreter. Similarly, ${varname#*:} is an expansion which trims everything up to and including the first : from the variable name passed to read. See the bash-hackers page on parameter expansion.
Using <(python ...) is process substitution syntax: The <(...) expression evaluates to a filename which, when read, will provide output of that command. Using < <(...) redirects output from that file, and thus that command (the first < is a redirection, whereas the second is part of the <( token that starts a process substitution). Using this form to get output into a while read loop avoids the bug mentioned in BashFAQ #24 ("I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?").
The IFS= read -r -d '' construct has a series of components, each of which makes the behavior of read more true to the original content:
Clearing IFS for the duration of the command prevents whitespace from being trimmed from the end of the variable's content.
Using -r prevents literal backslashes from being consumed by read itself rather than represented in the output.
Using -d '' sets the first character of the empty string '' to be the record delimiter. Since C strings are NUL-terminated and the shell uses C strings, that character is a NUL. This ensures that variables' content can contain any non-NUL value, including literal newlines.
See BashFAQ #001 ("How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?") for more on the process of reading record-oriented data from a string in bash.
Other answers give a way to do exactly what you ask for, but I think the idea is a bit crazy. There's a simpler way to satisfy both scripts - move those variables into a config file. You can even preserve the simple assignment format.
Create the config itself: (ini-style)
dt="01/03/16-01/09/16"
cust_id="12345"
In python:
config_vars = {}
with open('the/file/path', 'r') as f:
for line in f:
if '=' in line:
k,v = line.split('=', 1)
config_vars[k] = v
week_date = config_vars['dt']
cust_id = config_vars['cust_id']
In bash:
source "the/file/path"
And you don't need to do crazy source parsing anymore. Alternatively you can just use json for the config file and then use json module in python and jq in shell for parsing.
I would do something like this. You may want to modify it little bit for minor changes to include/exclude quotes as I didn't really tested it for your scenario:
#!/bin/sh
exec <$python_filename
while read line
do
match=`echo $line|grep "week_date ="`
if [ $? -eq 0 ]; then
dt=`echo $line|cut -d '"' -f 2`
fi
match=`echo $line|grep "cust_id ="`
if [ $? -eq 0 ]; then
cust_id=`echo $line|cut -d '"' -f 2`
fi
done
The objective is to extract an image from of a binary file. How do I search a binary file for the filetype's markers, SOI and EOI.
Regular find() functions don't seem to work as I cannot load the binary file as a string.
You want to search a magic word in a stream (not string).
Here's the idea:
read one char a time (use file.read(1)) from this file
use a queue length of your magic word, check the queue for each read
MAGIC_WORD = r'JPEG' # it's example... just example
l = list(c for c in f.read(len(MAGIC_WORD)))
offset = 0
while True:
if ''.join(l) == MAGIC_WORD:
return offset
offset += 1
l.pop(0)
l.append(f.read(1))
If you feel the need... I mean, the need for speed, check this wiki article, use a smarter algorithm, and finally switch to c++.
Sorry I don't know any presenting python library that does this. good luck
Another think:
if you can use unix shell (instead of Python), you can try to use unix pipes and chain some search tools (like grep and xxd)
like
cat yourbinfile | xxd -p | grep HEXMAGICWORD
where HEXMAGICWORD is
echo jpeg | xxd -p
I'm not very familiar with shell, so it's not exact answer.
given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.
There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with
import gzip
nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))
If you want to do something else with the lines, such as pass them to a different process, do
nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
nlines += 1
# pass the line to the other process
Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:
import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
for i, line in enumerate(f):
fastload_fifo.write(line)
print "Number of lines", i + 1
os.unlink(fifo_path)
I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.
This can be done in one simple line of bash:
zcat myfile.gz | tee >(wc -l >&2) | fastload
This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.
Actually, it should not be possible to pipe the data to Fastload at all, so it would be great if somebody post here an exact example if he could.
From Teradata documentation on the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=filename
Keyword phrase specifying the name of the data source that contains the input data. fileid must refer to a regular file. Specifically, pipes are not supported.