I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))
I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient
Thanks
Phil
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E
$INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50
$INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F
$INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C
$INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D
$INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D
$INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61
$INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A
$INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B
$INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68
$INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58
$INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79
$INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D
$INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56
$INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77
$INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40
$INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F
$INHDT,42.4,T*17
You can just read a line of file and write to another new file.
Like this:
import re
#open new file with append
nf = open('newfile', 'at')
#open file with read
with open('file', 'rt') as f:
for line in f:
r = re.match(r'\$INGGA', line)
if r is not None:
nf.write(line)
nf.write("$INHDT,31.9,T*1E" + '\n')
You can use at to append write and wt to read line!
I have 150,000 lines file, It's run well!
I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:
(\$INGGA.*\n\$INHDT.*\n)
https://regex101.com/r/tK1hF0/3
As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.
I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.
Here is some starter python example code:
import re
test_str = # load your file here
p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
In the example PuTTY log you give, its all one line separated with space.
So in this case you can use this to replace the space with new line and gets new file -
cat large_file | sed 's/ /\n/g' > new_large_file
To iterate over the file separated with new line, run this -
cat new_large_file | python your_script.py
Your script get line by line so your computer should not crash.
your_script.py -
import sys
INGGA_line = ""
for line in sys.stdin:
line_striped = line.strip()
if line_striped.startswith("$INGGA"):
INGGA_line = line_striped
elif line_striped.startswith("$INZDA"):
print line_striped, INGGA_line
else:
print line_striped
This answer is aimed at python 3.
According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:
with open(filename, 'r') as f:
for line in f:
...process...
An example of how you could fulfill your above criteria could be
# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
# Flag for whether we are looking for 1st or 2nd part
look_for_ingga = True
for line in sf:
if look_for_ingga:
if line.startswith('$INGGA,'):
tf.write(line)
look_for_ingga = False
elif line.startswith('$INHDT,'):
tf.write(line)
look_for_ingga = True
In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.
Refer to the docs for introductions to with-statements and file reading/writing.
I have a file where each line starts with a number. The user can delete a row by typing in the number of the row the user would like to delete.
The issue I'm having is setting the mode for opening it. When I use a+, the original content is still there. However, tacked onto the end of the file are the lines that I want to keep. On the other hand, when I use w+, the entire file is deleted. I'm sure there is a better way than opening it with w+ mode, deleting everything, and then re-opening it and appending the lines.
def DeleteToDo(self):
print "Which Item Do You Want To Delete?"
DeleteItem = raw_input(">") #select a line number to delete
print "Are You Sure You Want To Delete Number" + DeleteItem + "(y/n)"
VerifyDelete = str.lower(raw_input(">"))
if VerifyDelete == "y":
FILE = open(ToDo.filename,"a+") #open the file (tried w+ as well, entire file is deleted)
FileLines = FILE.readlines() #read and display the lines
for line in FileLines:
FILE.truncate()
if line[0:1] != DeleteItem: #if the number (first character) of the current line doesn't equal the number to be deleted, re-write that line
FILE.write(line)
else:
print "Nothing Deleted"
This is what a typical file may look like
1. info here
2. more stuff here
3. even more stuff here
When you open a file for writing, you clobber the file (delete its current contents and start a new file). You can find this out by reading documentation for the open() command.
When you open a file for appending, you do not clobber the file. But how can you delete just one line? A file is a sequence of bytes stored on a storage device; there is no way for you to delete one line and have all the other lines automatically "slide down" into new positions on the storage device.
(If your data was stored in a database, you could actually delete just one "row" from the database; but a file is not a database.)
So, the traditional way to solve this: you read from the original file, and you copy it to a new output file. As you copy, you perform any desired edits; for example, you can delete a line simply by not copying that one line; or you can insert a line by writing it in the new file.
Then, once you have successfully written the new file, and successfully closed it, if there is no error, you go ahead and rename the new file back to the same name as the old file (which clobbers the old file).
In Python, your code should be something like this:
import os
# "num_to_delete" was specified by the user earlier.
# I'm assuming that the number to delete is set off from
# the rest of the line with a space.
s_to_delete = str(num_to_delete) + ' '
def want_input_line(line):
return not line.startswith(s_to_delete)
in_fname = "original_input_filename.txt"
out_fname = "temporary_filename.txt"
with open(in_fname) as in_f, open(out_fname, "w") as out_f:
for line in in_f:
if want_input_line(line):
out_f.write(line)
os.rename(out_fname, in_fname)
Note that if you happen to have a file called temporary_filename.txt it will be clobbered by this code. Really we don't care what the filename is, and we can ask Python to make up some unique filename for us, using the tempfile module.
Any recent version of Python will let you use multiple statements in a single with statement, but if you happen to be using Python 2.6 or something you can nest two with statements to get the same effect:
with open(in_fname) as in_f:
with open(out_fname, "w") as out_f:
for line in in_f:
... # do the rest of the code
Also, note that I did not use the .readlines() method to get the input lines, because .readlines() reads the entire contents of the file into memory, all at once, and if the file is very large this will be slow or might not even work. You can simply write a for loop using the "file object" you get back from open(); this will give you one line at a time, and your program will work with even really large files.
EDIT: Note that my answer is assuming that you just want to do one editing step. As #jdi noted in comments for another answer, if you want to allow for "interactive" editing where the user can delete multiple lines, or insert lines, or whatever, then the easiest way is in fact to read all the lines into memory using .readlines(), insert/delete/update/whatever on the resulting list, and then only write out the list to a file a single time when editing is all done.
def DeleteToDo():
print ("Which Item Do You Want To Delete?")
DeleteItem = raw_input(">") #select a line number to delete
print ("Are You Sure You Want To Delete Number" + DeleteItem + "(y/n)")
DeleteItem=int(DeleteItem)
VerifyDelete = str.lower(raw_input(">"))
if VerifyDelete == "y":
FILE = open('data.txt',"r") #open the file (tried w+ as well, entire file is deleted)
lines=[x.strip() for x in FILE if int(x[:x.index('.')])!=DeleteItem] #read all the lines first except the line which matches the line number to be deleted
FILE.close()
FILE = open('data.txt',"w")#open the file again
for x in lines:FILE.write(x+'\n') #write the data to the file
else:
print ("Nothing Deleted")
DeleteToDo()
Instead of writing out all lines one by one to the file, delete the line from memory (to which you read the file using readlines()) and then write the memory back to disk in one shot. That way you will get the result you want, and you won't have to clog the I/O.
You could mmap the file... after haven read the suitable documentation...
You don't need to check for the lines numbers in your file, you can do something like this:
def DeleteToDo(self):
print "Which Item Do You Want To Delete?"
DeleteItem = int(raw_input(">")) - 1
print "Are You Sure You Want To Delete Number" + str(DeleteItem) + "(y/n)"
VerifyDelete = str.lower(raw_input(">"))
if VerifyDelete == "y":
with open(ToDo.filename,"r") as f:
lines = ''.join([a for i,a in enumerate(f) if i != DeleteItem])
with open(ToDo.filename, "w") as f:
f.write(lines)
else:
print "Nothing Deleted"