Calling gawk from Python

Calling gawk from Python - python

I am trying to call gawk (the GNU implementation of AWK) from Python in this manner.
import os
import string
import codecs
ligand_file=open( "2WTKA_ab.txt", "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines )
ligand_file.close()
for i in ligand_lines:
os.system ( " gawk %s %s"%( "'{if ($2==""i"") print $0}'", 'unique_count_a_from_ac.txt' ) )
My problem is that "i" is not being replaced by the value it represent. The value "i" represents is an integer and not a string. How can I fix this problem?

That's a non-portable and messy way to check if something is in a file. Imagine you have 1000 lines, you will be making system call to gawk 1000 times. It's super inefficient. You are using Python, so do them in Python.
....
ligand_file=open( "2WTKA_ab.txt", "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( str.strip, ligand_lines )
ligand_file.close()
for line in open("unique_count_a_from_ac.txt"):
sline=line.strip().split()
if sline[1] in ligand_lines:
print line.rstrip()
Or you can also use this one liner if Python is not a must.
gawk 'FNR==NR{a[$0]; next}($2 in a)' 2WTKA_ab.txt unique_count_a_from_ac.txt

Your problem is in the quoting, in python something like "some test "" with quotes" will not give you a quote. Try this instead:
os.system('''gawk '{if ($2=="%s") print $0}' unique_count_a_from_ac.txt''' % i)

Related

Using Makefile bash to save the contents of a python file

For those who are curious as to why I'm doing this: I need specific files in a tar ball - no more, no less. I have to write unit tests for make check, but since I'm constrained to having "no more" files, I have to write the check within the make check. In this way, I have to write bash(but I don't want to).
I dislike using bash for unit testing(sorry to all those who like bash. I just dislike it so much that I would rather go with an extremely hacky approach than to write many lines of bash code), so I wrote a python file. I later learned that I have to use bash because of some unknown strict rule. I figured that there was a way to cache the entire content of the python file into a single string in the bash file, so I could take the string literal in bash and write to a python file and then execute it.
I tried the following attempt (in the following script and result, I used another python file that's not unit_test.py, so don't worry if it doesn't actually look like a unit test):
toStr.py:
import re
with open("unit_test.py", 'r+') as f:
s = f.read()
s = s.replace("\n", "\\n")
print(s)
And then I piped the results out using:
python toStr.py > temp.txt
It looked something like:
#!/usr/bin/env python\n\nimport os\nimport sys\n\n#create number of bytes as specified in the args:\nif len(sys.argv) != 3:\n print("We need a correct number of args : 2 [NUM_BYTES][FILE_NAME].")\n exit(1)\nn = -1\ntry:\n n = int(sys.argv[1])\nexcept:\n print("Error casting number : " + sys.argv[1])\n exit(1)\n\nrand_string = os.urandom(n)\n\nwith open(sys.argv[2], 'wb+') as f:\n f.write(rand_string)\n f.flush()\n f.close()\n\n
I tried taking this as a string literal and echoing it into a new file and see whether I could run it as a python file but it failed.
echo '{insert that giant string above here}' > new_unit_test.py
I wanted to take this statement above and copy it into my "bash unit test" file so I can just execute the python file within the bash script.
The resulting file looked exactly like {insert giant string here}. What am I doing wrong in my attempt? Are there other, much easier ways where I can hold a python file as a string literal in a bash script?

the easiest way is to only use double-quotes in your python code, then, in your bash script, wrap all of your python code in one pair of single-quotes, e.g.,
#!/bin/bash
python -c 'import os
import sys
#create number of bytes as specified in the args:
if len(sys.argv) != 3:
print("We need a correct number of args : 2 [NUM_BYTES][FILE_NAME].")
exit(1)
n = -1
try:
n = int(sys.argv[1])
except:
print("Error casting number : " + sys.argv[1])
exit(1)
rand_string = os.urandom(n)
# i changed ""s to ''s below -webb
with open(sys.argv[2], "wb+") as f:
f.write(rand_string)
f.flush()
f.close()'

find and replace after the second column

I have the following lines
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;Sof_voya_Faible_Email_am;30/01/2015;Sof_voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;09/02/2015;Export Trav_Fort Postal
I'm trying to replace strings like Sof_ or _%yyyy%mm%dd% after the 7th field.
I thought about using sed
sed -i 's/<string_to_look_for>/<string_to_replace>/7g' filename
But it is only changing the field delimiter.
I thought about using this
awk -F";" '{ for (i=7; i<=NF; i++) print $i }' filename
but I don't know how to insert a search and replace for the strings I want to replace.
Any help is welcomed.
edit : expected outcome after replacing strings like Sof_ or _%yyyy%mm%dd% after the 7th column.
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal
to Python and Perl gurus, as i'm trying to ramp up my knowledge in these languages, your helps are welcomed:)

You can use this awk:
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/Sof_|_%yyyy%mm%dd%/, "", $i) } 1' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

Through python3.
#!/usr/bin/python3
import sys
fil = sys.argv[1]
with open(fil) as f:
for line in f:
part1 = ';'.join(line.split(';')[:7])
part2 = ';'.join(line.split(';')[7:]).replace('Sof_','').replace('_%yyyy%mm%dd%', '')
print(part1+';'+part2, end="")
save the above text in a file say script.py and then run it by,
python3 script.py inputfile
Through Perl.
$ perl -pe 's/^(?:[^;]*;){7}(*SKIP)(*F)|(?:_%yyyy%mm%dd%|Sof_)//g' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

In Python you would use the re and csv modules to do this:
import re
import csv
with open(fn) as fin:
r=csv.reader(fin, delimiter=';')
for line in r:
result=line[:7]
for field in line[:7]:
if re.search(r'Sof_', field):
field=re.sub(r'Sof_', 'repalcaement for Sof_', field)
if re.search(r'_%yyyy%mm%dd%', field):
field=re.sub(r'Sof_', 'repalcaement for _%yyyy%mm%dd%', field)
result.append(field)
print result

This might work for you (GNU sed):
sed -r ':a;s/^(([^;]*;){7}.*)(Sof_|_%yyyy%mm%dd%)/\1/;ta' file
This stores the first seven fields and following strings (that do not match the required strings) in the first backreference, then replaces the required strings by the said backreference.

Assuming you want the while line from the input file, and note: this starts with field #7. Your data exists earlier in each line.
awk -F";" '{ for (i=7; i<=NF; i++)
{gsub(/Sof_/,"newstring", ($i) } ;
print $0} ' filename
will replace Sof_ with "newstring". I'm not positive this is what you are looking for.
Correct syntax error - removed erratn ' character - thanks

Here is another way using perl's -F -a and autosplit:
perl -F";" -anE 'for ( #F[7..$#F] ) { $_ =~ s/Sof_|_%yyyy%mm%dd%//g }
print join ";", #F;' file.txt
This grabs elements 7 to last ($#F) of the autocreated #F array and removes/substitutes the text.

Using grep in python

There is a file (query.txt) which has some keywords/phrases which are to be matched with other files using grep. The last three lines of the following code are working perfectly but when the same command is used inside the while loop it goes into an infinite loop or something(ie doesn't respond).
import os
f=open('query.txt','r')
b=f.readline()
while b:
cmd='grep %s my2.txt'%b #my2 is the file in which we are looking for b
os.system(cmd)
b=f.readline()
f.close()
a='He is'
cmd='grep %s my2.txt'%a
os.system(cmd)

First of all, you are not iterating over the file properly. You can simply use for b in f: without the .readline() stuff.
Then your code will blow in your face as soon as the filename contains any characters which have a special meaning in the shell. Use subprocess.call instead of os.system() and pass an argument list.
Here's a fixed version:
import os
import subprocess
with open('query.txt', 'r') as f:
for line in f:
line = line.rstrip() # remove trailing whitespace such as '\n'
subprocess.call(['/bin/grep', line, 'my2.txt'])
However, you can improve your code even more by not calling grep at all.
Read my2.txt to a string instead and then use the re module to perform the search. In case you do not need a regex at all, you can even simply use if line in my2_content

Your code scans the whole my2.txt file for each query in query.txt.
You want to:
read all queries into a list
iterate once over all lines of the text file and check each file against all queries.
Try this code:
with open('query.txt','r') as f:
queries = [l.strip() for l in f]
with open('my2.txt','r') as f:
for line in f:
for query in queries:
if query in line:
print query, line

This isn't actually a good way to use Python, but if you have to do something like that, then do it correctly:
from __future__ import with_statement
import subprocess
def grep_lines(filename, query_filename):
with open(query_filename, "rb") as myfile:
for line in myfile:
subprocess.call(["/bin/grep", line.strip(), filename])
grep_lines("my2.txt", "query.txt")
And hope that your file doesn't contain any characters which have special meanings in regular expressions =)
Also, you might be able to do this with grep alone:
grep -f query.txt my2.txt
It works like this:
~ $ cat my2.txt
One two
two two
two three
~ $ cat query.txt
two two
three
~ $ python bar.py
two two
two three

$ grep -wFf query.txt my2.txt > out.txt
this will match all the keywords in query.txt with my2.txt file and save the output in out.txt
Read man grep for a description of all the possible arguments.

Replace part of string using python regular expression

I have the following lines (many, many):
...
gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567
..
What I'd like to do is to find line 'particular' (whatever number is after ':')
and replace this number with '111222333'. How can I do that using python regular expressions ?

for line in input:
key, val = line.split(':')
if key == 'particular':
val = '111222333'
I'm not sure regex would be of any value in this specific case. My guess is they'd be slower. That said, it can be done. Here's one way:
for line in input:
re.sub('^particular : .*', 'particular : 111222333')
There are subtleties involved in this, and this is almost certainly not what you'd want in production code. You need to check all of the re module constants to make sure the regex is acting the way you expect, etc. You might be surprised at the flexibility you find in dealing with problems like this in Python if you try not to use re (of course, this isn't to say re isn't useful) ;-)

Sure you need a regular expression?
other_number = '111222333'
some_text, some_number = line.split(': ')
new_line = ': '.join(some_text, other_number)

#!/usr/bin/env python
import re
text = '''gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567'''
print(re.sub('[0-9]+', '111222333', text))

input = """gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567"""
entries = re.split("\n+", input)
for entry in entries:
if entry.startswith("particular"):
entry = re.sub(r'[0-9]+', r'111222333', entry)
or with sed:
sed -e 's/^particular: [0-9].*$/particular: 111222333/g' file

An important point here is that if you have a lot of lines, you want to process them one by one. That is, instead of reading all the lines in replacing them, and writing them out again, you should read in a line at a time and write out a line at a time. (This would be inefficient if you were actually reading a line at a time from the disk; however, Python's IO is competent and will buffer the file for you.)
with open(...) as infile, open(...) as outfile:
for line in infile:
if line.startswith("particular"):
outfile.write("particular: 111222333")
else:
outfile.write(line)
This will be speed- and memory-efficient.

Your sed example forces me to say neat!
python -c "import re, sys; print ''.join(re.sub(r'^(particular:) \d+', r'\1 111222333', l) for l in open(sys.argv[1]))" file

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?

Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program

You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this

(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)

As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).

dos linebreaks are \r\n, unix only \n. So just search for \r\n.

Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'

You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calling gawk from Python - python

Your problem is in the quoting, in python something like "some test "" with quotes" will not give you a quote. Try this instead: os.system('''gawk '{if ($2=="%s") print $0}' unique_count_a_from_ac.txt''' % i)

Related

Using Makefile bash to save the contents of a python file

find and replace after the second column

Using grep in python

Replace part of string using python regular expression

How can I detect DOS line breaks in a file?

Categories

Resources