I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))
I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient
Thanks
Phil
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E
$INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50
$INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F
$INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C
$INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D
$INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D
$INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61
$INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A
$INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B
$INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68
$INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58
$INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79
$INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D
$INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56
$INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77
$INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40
$INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F
$INHDT,42.4,T*17
You can just read a line of file and write to another new file.
Like this:
import re
#open new file with append
nf = open('newfile', 'at')
#open file with read
with open('file', 'rt') as f:
for line in f:
r = re.match(r'\$INGGA', line)
if r is not None:
nf.write(line)
nf.write("$INHDT,31.9,T*1E" + '\n')
You can use at to append write and wt to read line!
I have 150,000 lines file, It's run well!
I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:
(\$INGGA.*\n\$INHDT.*\n)
https://regex101.com/r/tK1hF0/3
As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.
I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.
Here is some starter python example code:
import re
test_str = # load your file here
p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
In the example PuTTY log you give, its all one line separated with space.
So in this case you can use this to replace the space with new line and gets new file -
cat large_file | sed 's/ /\n/g' > new_large_file
To iterate over the file separated with new line, run this -
cat new_large_file | python your_script.py
Your script get line by line so your computer should not crash.
your_script.py -
import sys
INGGA_line = ""
for line in sys.stdin:
line_striped = line.strip()
if line_striped.startswith("$INGGA"):
INGGA_line = line_striped
elif line_striped.startswith("$INZDA"):
print line_striped, INGGA_line
else:
print line_striped
This answer is aimed at python 3.
According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:
with open(filename, 'r') as f:
for line in f:
...process...
An example of how you could fulfill your above criteria could be
# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
# Flag for whether we are looking for 1st or 2nd part
look_for_ingga = True
for line in sf:
if look_for_ingga:
if line.startswith('$INGGA,'):
tf.write(line)
look_for_ingga = False
elif line.startswith('$INHDT,'):
tf.write(line)
look_for_ingga = True
In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.
Refer to the docs for introductions to with-statements and file reading/writing.
I am trying to add a line to the end of a txt file. I have been reading some posts here and trying differents options, but, for some reason, the new line is neved added after the last one, it is just appended next to the last one.
So I was wondering what I am doing wrong....here I am showing my tests:
TEST 1:
#newProt is a new data entered by the user in this case 12345
exists = False
f = open('protocols.txt', 'a+')
for line in f:
if newProt == line:
exists = True
if not exists:
f.write(newProt)
f.close()
txt file after this code:
2sde45
21145
we34z12345
TEST 2:
exists = False
with open('protocols.txt', 'r+') as f:
for line in f:
if newProt == line:
exists = True
if not exists:
f.write(newProt)
txt file after this code: exactly the same as above...
And, like this, I have tested some combinations of letters to open the file, rb+, w, etc but for some reason I never get the desired output txt file:
2sde45
21145
we34z
12345
So I do not know what I am doing wrong, I am following some examples I gor from some other posts here.
Try this:
exists = False
f = open('protocols.txt', 'a+')
for line in f:
if newProt == line:
exists = True
if not exists:
f.write('\n' + newProt)
f.close()
This adds the new line character to the end of the file then adds 'newProt'.
EDIT:
The reason why your code did not produce the desired result is because you were simply writing a string to the file. New lines in text are not really 'in' the text file. The text file is literally a series of bytes known as chars. The reason why various applications such as text editors show you new lines is because it interprets certain characters as formatting elements rather than letters or numbers.
'\n' is one such formatting character (in the ASCII standard), and it tells your favorite text editor to start a new line. There are others such as '\t' which makes a tab.
Have a look at the wiki article on Newline character for more info
You can use f.seek(-x,x), reach the last line and then f.write().
Otherwise my understanding is if you open a file in "a" (append) mode, it'll anyways be written in the end
Refer to this link: Appending line to a existing file having extra new line in Python
I have a 'key' file that looks like this (MyKeyFile):
afdasdfa ghjdfghd wrtwertwt asdf (these are in a column, but I never figured out the formatting, sorry)
I call these keys and they are identical to the first word of the lines that I want to extract from a 'source' file. So the source file (MySourceFile) would look something like this (again, bad formatting, but 1st column = the key, following columns = data):
afdasdfa (several tab delimited columns)
.
.
ghjdfghd ( several tab delimited columns)
.
wrtwertwt
.
.
asdf
And the '.' would indicate lines of no interest currently.
I am an absolute novice in Python and this is how far I've come:
with open('MyKeyFile','r') as infile, \
open('MyOutFile','w') as outfile:
for line in infile:
for runner in source:
# pick up the first word of the line in source
# if match, print the entire line to MyOutFile
# here I need help
outfile.close()
I realize there may be better ways to do this. All feedback is appreciated - along my way of solving it, or along more sophisticated ones.
Thanks
jd
I think that this would be a cleaner way of doing it, assuming that your "key" file is called "key_file.txt" and your main file is called "main_file.txt"
keys = []
my_file = open("key_file.txt","r") #r is for reading files, w is for writing to them.
for line in my_file.readlines():
keys.append(str(line)) #str() is not necessary, but it can't hurt
#now you have a list of strings called keys.
#take each line from the main text file and check to see if it contains any portion of a given key.
my_file.close()
new_file = open("main_file.txt","r")
for line in new_file.readlines():
for key in keys:
if line.find(key) > -1:
print "I FOUND A LINE THAT CONTAINS THE TEXT OF SOME KEY", line
You can modify the print function or get rid of it to do what you want with the desired line that contains the text of some key. Let me know if this works
As I understood (corrent me in the comments if I am wrong), you have 3 files:
MySourceFile
MyKeyFile
MyOutFile
And you want to:
Read keys from MyKeyFile
Read source from MySourceFile
Iterate over lines in the source
If line's first word is in keys: append that line to MyOutFile
Close MyOutFile
So here is the Code:
with open('MySourceFile', 'r') as sourcefile:
source = sourcefile.read().splitlines()
with open('MyKeyFile', 'r') as keyfile:
keys = keyfile.read().split()
with open('MyOutFile', 'w') as outfile:
for line in source:
if line.split():
if line.split()[0] in keys:
outfile.write(line + "\n")
outfile.close()
I need to edit my file and save it so that I can use it for another program . First I need to put "," in between every word and add a word at the end of every line.
In order to put "," in between every word , I used this command
for line in open('myfile','r+') :
for word in line.split():
new = ",".join(map(str,word))
print new
I'm not too sure how to overwrite the original file or maybe create a new output file for the edited version . I tried something like this
with open('myfile','r+') as f:
for line in f:
for word in line.split():
new = ",".join(map(str,word))
f.write(new)
The output is not what i wanted (different from the print new) .
Second, I need to add a word at the end of every line. So, i tried this
source = open('myfile','r')
output = open('out','a')
output.write(source.read().replace("\n", "yes\n"))
The code to add new word works perfectly. But I was thinking there should be an easier way to open a file , do two editing in one go and save it. But I'm not too sure how. Ive spent a tremendous amount of time to figure out how to overwrite the file and it's about time I seek for help
Here you go:
source = open('myfile', 'r')
output = open('out','w')
output.write('yes\n'.join(','.join(line.split()) for line in source.read().split('\n')))
One-liner:
open('out', 'w').write('yes\n'.join(','.join(line.split() for line in open('myfile', 'r').read().split('\n')))
Or more legibly:
source = open('myfile', 'r')
processed_lines = []
for line in source:
line = ','.join(line.split()).replace('\n', 'yes\n')
processed_lines.append(line)
output = open('out', 'w')
output.write(''.join(processed_lines))
EDIT
Apparently I misread everything, lol.
#It looks like you are writing the word yes to all of the lines, then spliting
#each word into letters and listing those word's letters on their own line?
source = open('myfile','r')
output = open('out','w')
for line in source:
for word in line.split():
new = ",".join(word)
print >>output, new
print >>output, 'y,e,s'
How big is this file?
Maybe You could create a temporary list which would just contain everything from file you want to edit. Every element could represent one line.
Editing list of strings is pretty simple.
After Your changes you can just open Your file again with
writable = open('configuration', 'w')
and then put changed lines to file with
file.write(writable, currentLine + '\n')
.
Hope that helps - even a little bit. ;)
For the first problem, you could read all the lines in f before overwriting f, assuming f is opened in 'r+' mode. Append all the results into a string, then execute:
f.seek(0) # reset file pointer back to start of file
f.write(new) # new should contain all concatenated lines
f.truncate() # get rid of any extra stuff from the old file
f.close()
For the second problem, the solution is similar: Read the entire file, make your edits, call f.seek(0), write the contents, f.truncate() and f.close().