Using grep in python

Using grep in python - python

There is a file (query.txt) which has some keywords/phrases which are to be matched with other files using grep. The last three lines of the following code are working perfectly but when the same command is used inside the while loop it goes into an infinite loop or something(ie doesn't respond).
import os
f=open('query.txt','r')
b=f.readline()
while b:
cmd='grep %s my2.txt'%b #my2 is the file in which we are looking for b
os.system(cmd)
b=f.readline()
f.close()
a='He is'
cmd='grep %s my2.txt'%a
os.system(cmd)

First of all, you are not iterating over the file properly. You can simply use for b in f: without the .readline() stuff.
Then your code will blow in your face as soon as the filename contains any characters which have a special meaning in the shell. Use subprocess.call instead of os.system() and pass an argument list.
Here's a fixed version:
import os
import subprocess
with open('query.txt', 'r') as f:
for line in f:
line = line.rstrip() # remove trailing whitespace such as '\n'
subprocess.call(['/bin/grep', line, 'my2.txt'])
However, you can improve your code even more by not calling grep at all.
Read my2.txt to a string instead and then use the re module to perform the search. In case you do not need a regex at all, you can even simply use if line in my2_content

Your code scans the whole my2.txt file for each query in query.txt.
You want to:
read all queries into a list
iterate once over all lines of the text file and check each file against all queries.
Try this code:
with open('query.txt','r') as f:
queries = [l.strip() for l in f]
with open('my2.txt','r') as f:
for line in f:
for query in queries:
if query in line:
print query, line

This isn't actually a good way to use Python, but if you have to do something like that, then do it correctly:
from __future__ import with_statement
import subprocess
def grep_lines(filename, query_filename):
with open(query_filename, "rb") as myfile:
for line in myfile:
subprocess.call(["/bin/grep", line.strip(), filename])
grep_lines("my2.txt", "query.txt")
And hope that your file doesn't contain any characters which have special meanings in regular expressions =)
Also, you might be able to do this with grep alone:
grep -f query.txt my2.txt
It works like this:
~ $ cat my2.txt
One two
two two
two three
~ $ cat query.txt
two two
three
~ $ python bar.py
two two
two three

$ grep -wFf query.txt my2.txt > out.txt
this will match all the keywords in query.txt with my2.txt file and save the output in out.txt
Read man grep for a description of all the possible arguments.

Related

How to run code with sys.stdin as input on multiple text files

I am using sys.stdin in my code, and I want to know how to test my code on multiple text files.
My code(test.py) is:
for line in sys.stdin:
line = line.strip()
words = line.split()
I am trying to test it on 2 text files, so I type in terminal:
echo "test1.txt" "test2.txt" | test.py
but it not works, so I just want to know how can I test the code on 2 text files?

echo "test1.txt" "test2.txt" | test.py
Won't actually run test.py, you need to use this command instead:
echo "test1.txt" "test2.txt" | python test.py
However, another method for getting arguments into python would be:
import sys
for arg in sys.argv:
print line
Which when run like so:
python test.py "test1" "test2"
Produces the following output:
test.py
test1
test2
The first argument of argv is the name of the program. This can be skipped with:
import sys
for arg in sys.argv[1:]:
print line
A further problem you appear to be having is you're assuming that python is opening the text files you're handing it in the loop - this isn't true. If you print in the loop you'll see it's only printing the strings you gave it initially.
If you actually want to open and parse the files, do something like this in the loop:
import sys
args = sys.stdin.readlines()[0].replace("\"","").split()
for arg in args:
arg = arg.strip()
with open(arg, "r") as f:
for line in f:
line = line.strip()
words = line.split()
The reason we have that weird first line is that stdin is a stream, so we have to read it in via readlines().
The result is a list with a single element (because we only gave it one line), hence teh [0]
Then we need to remove the internal quotes, because the quotes aren't really required when piping, this would also work:
echo test1.txt test2.txt | python test.py
Finally, we have to split the string into the actual filenames.

Using Makefile bash to save the contents of a python file

For those who are curious as to why I'm doing this: I need specific files in a tar ball - no more, no less. I have to write unit tests for make check, but since I'm constrained to having "no more" files, I have to write the check within the make check. In this way, I have to write bash(but I don't want to).
I dislike using bash for unit testing(sorry to all those who like bash. I just dislike it so much that I would rather go with an extremely hacky approach than to write many lines of bash code), so I wrote a python file. I later learned that I have to use bash because of some unknown strict rule. I figured that there was a way to cache the entire content of the python file into a single string in the bash file, so I could take the string literal in bash and write to a python file and then execute it.
I tried the following attempt (in the following script and result, I used another python file that's not unit_test.py, so don't worry if it doesn't actually look like a unit test):
toStr.py:
import re
with open("unit_test.py", 'r+') as f:
s = f.read()
s = s.replace("\n", "\\n")
print(s)
And then I piped the results out using:
python toStr.py > temp.txt
It looked something like:
#!/usr/bin/env python\n\nimport os\nimport sys\n\n#create number of bytes as specified in the args:\nif len(sys.argv) != 3:\n print("We need a correct number of args : 2 [NUM_BYTES][FILE_NAME].")\n exit(1)\nn = -1\ntry:\n n = int(sys.argv[1])\nexcept:\n print("Error casting number : " + sys.argv[1])\n exit(1)\n\nrand_string = os.urandom(n)\n\nwith open(sys.argv[2], 'wb+') as f:\n f.write(rand_string)\n f.flush()\n f.close()\n\n
I tried taking this as a string literal and echoing it into a new file and see whether I could run it as a python file but it failed.
echo '{insert that giant string above here}' > new_unit_test.py
I wanted to take this statement above and copy it into my "bash unit test" file so I can just execute the python file within the bash script.
The resulting file looked exactly like {insert giant string here}. What am I doing wrong in my attempt? Are there other, much easier ways where I can hold a python file as a string literal in a bash script?

the easiest way is to only use double-quotes in your python code, then, in your bash script, wrap all of your python code in one pair of single-quotes, e.g.,
#!/bin/bash
python -c 'import os
import sys
#create number of bytes as specified in the args:
if len(sys.argv) != 3:
print("We need a correct number of args : 2 [NUM_BYTES][FILE_NAME].")
exit(1)
n = -1
try:
n = int(sys.argv[1])
except:
print("Error casting number : " + sys.argv[1])
exit(1)
rand_string = os.urandom(n)
# i changed ""s to ''s below -webb
with open(sys.argv[2], "wb+") as f:
f.write(rand_string)
f.flush()
f.close()'

how to filter out lines between two timestamps in python

I have following issue, i have a log file that i want to read line by line, but to reduce the lines i want to filter out the lines that are between two timestamps!
example in awk:
find all between two patterns: pattern1 = 2012-10-23 14, pattern2 = 2012-10-23 16
awk '/2012-10-23 14/{P=1;next}/2012-10-23 16/{exit} P' server.log
or with egrep and one pattern:
egrep "2012-10-23 (1[4-6]:[0-5][0-9])" server.log
The above awk line would give me only the lines between those two timestamps.
How can i do it in python without executing any system command or awk, grep..., but only with python regular expression
Thanks in adv.

one to one translation from your awk code:
with open('yourFile') as f:
lines = f.read().splitlines()
for l in lines:
if l.startswith('2012-10-23 14'):
p=1
elif l.startswith('2012-10-23 16'):
p=0
break
if p: print l
this will start the output when the 1st line starting with 2012-10-23 14 ... matched, and stop printting when the 1st line starting with 2012-10-23 16.. matched. (same as your awk codes)

I think that the #Kent post will work only if we assume that the timestamp is at the beginning of your line. With your AWK / egrep code you ask for something more generic.
Following code should work:
independently where the searched pattern within the line is located
independently on if the lines in the log are properly sorted (though this is highly assumable ;-) )
as non-blocking generator yielding the results as they are being processed without unnecessary memory allocation.
has more generic code construction, in case you want to make further modifications.
import re
def log_lines(yourFile, regexp):
rxp = re.compile(regexp)
with open(yourFile) as f:
for line in f.readlines():
if rxp.search(line):
yield line
for line in log_lines("yourFile", "2012-10-23 1[4-6]"):
print line
Stay with python, it is adictive ;-)

Pythonic way to send contents of a file to a pipe and count # lines in a single step

given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.

There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with
import gzip
nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))
If you want to do something else with the lines, such as pass them to a different process, do
nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
nlines += 1
# pass the line to the other process

Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:
import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
for i, line in enumerate(f):
fastload_fifo.write(line)
print "Number of lines", i + 1
os.unlink(fifo_path)
I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.

This can be done in one simple line of bash:
zcat myfile.gz | tee >(wc -l >&2) | fastload
This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.

Actually, it should not be possible to pipe the data to Fastload at all, so it would be great if somebody post here an exact example if he could.
From Teradata documentation on the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=filename
Keyword phrase specifying the name of the data source that contains the input data. fileid must refer to a regular file. Specifically, pipes are not supported.

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?

Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program

You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this

(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)

As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).

dos linebreaks are \r\n, unix only \n. So just search for \r\n.

Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'

You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using grep in python - python

$ grep -wFf query.txt my2.txt > out.txt this will match all the keywords in query.txt with my2.txt file and save the output in out.txt Read man grep for a description of all the possible arguments.

Related

How to run code with sys.stdin as input on multiple text files

Using Makefile bash to save the contents of a python file

how to filter out lines between two timestamps in python

Pythonic way to send contents of a file to a pipe and count # lines in a single step

How can I detect DOS line breaks in a file?

Categories

Resources