Use grep on file in Python - python

I have searched the grep answers on here and cannot find an answer. They all seem to search for a string in a file, not a list of strings from a file. I already have a search function that works, but grep does it WAY faster. I have a list of strings in a file sn.txt (with one string on each line, no deliminators). I want to search another file (Merge_EXP.exp) for lines that have a match and write it out to a new file. The file I am searching in has a half millions lines, so searching for a few thousand in there takes hours without grep.
When I run it from command prompt in windows, it does it in minutes:
grep --file=sn.txt Merge_EXP.exp > Merge_EXP_Out.exp
How can I call this same process from Python? I don't really want alternatives in Python because I already have one that works but takes a while. Unless you think you can significantly improve the performance of that:
def match_SN(serialnumb, Exp_Merge, output_exp):
fout = open(output_exp,'a')
f = open(Exp_Merge,'r')
# skip first line
f.readline()
for record in f:
record = record.strip().rstrip('\n')
if serialnumb in record:
fout.write (record + '\n')
f.close()
fout.close()
def main(Output_CSV, Exp_Merge, updated_exp):
# create a blank output
fout = open(updated_exp,'w')
# copy header records
f = open(Exp_Merge,'r')
header1 = f.readline()
fout.write(header1)
header2 = f.readline()
fout.write(header2)
fout.close()
f.close()
f_csv = open(Output_CSV,'r')
f_csv.readline()
for rec in f_csv:
rec_list = rec.split(",")
sn = rec_list[2]
sn = sn.strip().rstrip('\n')
match_SN(sn,Exp_Merge,updated_exp)

Here is a optimized version of pure python code:
def main(Output_CSV, Exp_Merge, updated_exp):
output_list = []
# copy header records
records = open(Exp_Merge,'r').readlines()
output_list = records[0:2]
serials = open(Output_CSV,'r').readlines()
serials = [x.split(",")[2].strip().rstrip('\n') for x in serials]
for s in serials:
items = [x for x in records if s in x]
output_list.extend(items)
open(updated_exp, "w").write("".join(output_list))
main("sn.txt", "merge_exp.exp", "outx.txt")
Input
sn.txt:
x,y,0011
x,y,0002
merge_exp.exp:
Header1
Header2
0011abc
0011bcd
5000n
5600m
6530j
0034k
2000lg
0002gg
Output
Header1
Header2
0011abc
0011bcd
0002gg
Try this out and see how much time it takes...

When I use full path to grep location it worked (I pass it the grep_loc, Serial_List, Export):
import os
Export_Dir = os.path.dirname(Export)
Export_Name = os.path.basename(Export)
Output = Export_Dir + "\Output_" + Export_Name
print "\nOutput: " + Output + "\n"
cmd = grep_loc + " --file=" + Serial_List + " " + Export + " > " + Output
print "grep usage: \n" + cmd + "\n"
os.system(cmd)
print "Output created\n"

I think you have not chosen the right title for your question: What you want to do is the equivalent of a database JOIN. You can use grep for that in this particular instance, because one of your files only has keys and no other information. However, I think it is likely (but of course I don't know your case) that in the future your sn.txt may also contain extra information.
So I would solve the generic case. There are multiple solutions:
import all data into a database, then do a LEFT JOIN (in sql) or equivalent
use a python large data tool
For the latter, you could try numpy or, recommended because you are working with strings, pandas. Pandas has an optimized merge routine, which is very fast in my experience (uses cython under the hood).
Here is pandas PSEUDO code to solve your problem. It is close to real code but I need to know the names of the columns that you want to match on. I assumed here the one column in sn.txt is called key, and the matching column in merge_txt is called sn. I also see you have two header lines in merge_exp, read the docs for that.
# PSEUDO CODE (but close)
import pandas
left = pandas.read_csv('sn.txt')
right = pandas.read_csv('merge_exp.exp')
out = pandas.merge(left, right, left_on="key", right_on="sn", how='left')
out.to_csv("outx.txt")

Related

Python print and write output end in ". . ." rather than the complete line

I have tried moving around the strings and variables I am concatenating, using while loops, moved the line and method that I am opening the outfile, etc. No matter what I do my output prints/writes "curl" + my url variable. From there it ends in "..." ex: curl "https://examplesite/...
Does this have something to do with a buffer or slicing problem? Thank you for any and all help. Full code below.
import pandas as pd
# file = open("output.txt","wt")
header_list = ["COLA", "COLB"]
df = pd.read_csv("curl_data.csv", names=header_list)
df_length = len(df)
iterator = 0
with open("output.txt", "w") as file:
for row in df.iterrows():
url = '"https://examplesite'
lic = df.COLA # use %20 instead of spaces
name = df.COLB # use %20 instead of spaces
group = "example group" # use %20 instead of spaces
command = "curl " + url + "license=" + lic + "&name=" + name + "&group=" + group + '"'
print(command)
file.write(str(command))
iterator += 1
if iterator == 1:
break
file.close()
Solved. As Imre Kerr suggested in the comments the problem was with the length of the output.
I changed my for loop to be for i in range(len(df)): this only looped through the dataframe once (as per Barmars suggestion) and changed the references to the columns in my code from df.COLA to df.loc[i, "COLA] so that it did not print the whole dataset everytime. This fixed the problem of the lines being too long and thus I was able to see the full line for each outputted string.

Drawing multiple sequences from 1 file, based on shared fields in another file

I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.
The code is as follows:
from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)
gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
fields = re.split("\t", line.rstrip())
gene_fams[fields[0]].append[fields[1]]
for fam in gene_fams.keys():
output_filename = str(fam) + ".fasta"
outh = open(output_filename, "w")
for id in gene_fams[fam]:
if id in all_seqs:
outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
else:
print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
quit()
outh.close()
The list looks like this:
1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1
The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.
So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.
So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.
Any and all help is appreciated,
Thanks,
JT
Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:
gene_fams[fields[0]].append[fields[1]]
i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:
from Bio import SeqIO
from collections import defaultdict
SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"
all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)
with open(FAMILY_FILE_NAME) as gene_families_file:
for line in gene_families_file:
family_id, gene_id = line.rstrip().split()
gene_families[family_id].append(gene_id)
for family_id, gene_ids in gene_families.items():
output_filename = family_id + ".fasta"
with open(output_filename, "w") as output:
for gene_id in gene_ids:
assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)
output.write(all_sequences[gene_id].format("fasta"))

Reading data from Excel sheets and building SQL statements, writing to output file in Python

I have an excel book with a couple of sheets. Each sheet has two columns with PersonID and LegacyID. We are basically trying to update some records in the database based on personid. This is relatively easy to do TSQL and I might even be able to get it done pretty quick in powershell but since I have been trying to learn Python, I thought I would try this in Python. I used xlrd module and I was able to print update statements. below is my code:
import xlrd
book = xlrd.open_workbook('D:\Scripts\UpdateID01.xls')
sheet = book.sheet_by_index(0)
myList = []
for i in range(sheet.nrows):
myList.append(sheet.row_values(i))
outFile = open('D:\Scripts\update.txt', 'wb')
for i in myList:
outFile.write("\nUPDATE PERSON SET LegacyID = " + "'" + str(i[1]) + "'" + " WHERE personid = " + "'" + str(i[0])
+ "'")
Two problems - when I read the output file, I see the LegacyID printed as float. How do I get rid of .0 at the end of each id? Second problem, python doesn't print each update statement in a new line in the output text file. How to I format it?
Edit: Please ignore the format issue. It did print in new lines when I opened the output file in Notepad++. The float issue still remains.
Can you turn the LegacyID into ints ?
i[1] = int(i[1])
outFile.write("\nUPDATE PERSON SET LegacyID = " + "'" + str(i[1]) + "'" + " WHERE personid = " + "'" + str(i[0])
+ "'")
try this..
# use 'a' if you want to append in your text file
outFile = open(r'D:\Scripts\update.txt', 'a')
for i in myList:
outFile.write("\nUPDATE PERSON SET LegacyID = '%s' WHERE personid = '%s'" %( int(i[1]), str(i[0])))
Since you are learning Python (which is very laudable!) you should start reading about string formatting in the Python docs. This is the best place to start whenever you have a question light this.
Hint: You may want to convert the float items to integers using int().

python & pyparsing newb: how to open a file

Paul McGuire, the author of pyparsing, was kind enough to help a lot with a problem I'm trying to solve. We're on 1st down with a yard to goal, but I can't even punt it across the goal line. Confucius said if he gave a student 1/4 of the solution, and he did not return with the other 3/4s, then he would not teach that student again. So it is after almost a week of frustation and with great anxiety that I ask this...
How do I open an input file for pyparsing and print the output to another file?
Here is what I've got so far, but it's really all his work
from pyparsing import *
datafile = open( 'test.txt' )
# Backaus Nuer Form
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")
lastPatientData = None
# PARSE ACTIONS
def patientRecord( datafile ):
for match in partMatch.searchString(datafile):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
lastPatientData.patientData, match.gleason
)
patientData.setParseAction(lastPatientData)
# MAIN PROGRAM
if __name__=="__main__":
patientRecord()
It looks like you need to call datafile.read() in order to read the contents of the file. Right now you are trying to call searchString on the file object itself, not the text in the file. You should really look at the Python tutorial (particularly this section) to get up to speed on how to read files, etc.
It seems like you need some help putting it together. The advice of #BrenBarn is spot-on, work with problem of simple complexity before you put it all together. I can help by giving you a minimal example of what you are trying to do, with a much simpler grammar. You can use this as a template to learn how to read/write a file in python. Consider the input text file data.txt:
cat 3
dog 5
foo 7
Let's parse this file and output the results. To have some fun, let's mulpitply the second column by 2:
from pyparsing import *
# Read the input data
filename = "data.txt"
FIN = open(filename)
TEXT = FIN.read()
# Define a simple grammar for the text, multiply the first col by 2
digits = Word(nums)
digits.setParseAction(lambda x:int(x[0]) * 2)
blocks = Group(Word(alphas) + digits)
grammar = OneOrMore(blocks)
# Parse the results
result = grammar.parseString( TEXT )
# This gives a list of lists
# [['cat', 6], ['dog', 10], ['foo', 14]]
# Open up a new file for the output
filename2 = "data2.txt"
FOUT = open(filename2,'w')
# Walk through the results and write to the file
for item in result:
print item
FOUT.write("%s %i\n" % (item[0],item[1]))
FOUT.close()
This gives in data2.txt:
cat 6
dog 10
foo 14
Break each piece down until you understand it. From here, you can slowly adapt this minimal example to your more complex problem above. It's OK to read the file in (as long as it is relatively small) since Paul himself notes:
parseFile is really just a simple shortcut around parseString, pretty
much the equivalent of expr.parseString(open(filename).read()).

Function append lines to .csv

It has been awhile since I have written functions with for loops and writing to files so bare with my ignorance.
This function is given an IP address to read from a text file; pings the IP, searches for the received packets and then appends it to a .csv
My question is: Is there a better or an easier way to write this?
def pingS (IPadd4):
fTmp = "tmp"
os.system ("ping " + IPadd4 + "-n 500 > tmp")
sName = siteNF #sys.argv[1]
scrap = open(fTmp,"r")
nF = file(sName,"a") # appends
nF.write(IPadd4 + ",")
for line in scrap:
if line.startswith(" Packets"):
arrT = line.split(" ")
nF.write(arrT[10]+" \n")
scrap.close()
nF.close()
Note: If you need the full script I can supply that as well.
This in my opinion at least makes what is going on a bit more obvious. The len('Received = ') could obviously be replaced by a constant.
def pingS (IPadd4):
fTmp = "tmp"
os.system ("ping " + IPadd4 + "-n 500 > tmp")
sName = siteNF #sys.argv[1]
scrap = open(fTmp,"r")
nF = file(sName,"a") # appends
ip_string = scrap.read()
recvd = ip_string[ip_string.find('Received = ') + len('Received = ')]
nF.write(IPadd4 + ',' + recvd + '\n')
You could also try looking at the Python csv module for writing to the csv. In this case it's pretty trivial though.
This may not be a direct answer, but you may get some performance increase from using StringIO. I have had some dramatic speedups in IO with this. I'm a bioinformatics guy, so I spend a lot of time shooting large text files out of my code.
http://www.skymind.com/~ocrow/python_string/
I use method 5. Didn't require many changes. There are some fancier methods in there, but they didn't appeal to me as much.

Categories