Related
I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))
I have a database.txt file the first column is for usernames the second passwords and the rest 5 recovery question and answers alternating. I want to allow the user to be able to change the password of their details, without affecting another users username as they may be the same. I have found a way to delete the previous one and append the new line of modified details to the file. However, the is always a string or unknown characters at the start of the appended line. AND other characters are being changed not the second value in the list. Please help me find a way to avoid this.
https://repl.it/repls/NecessaryBoldButtonsYou can find the code here changing it will affect everyone, so please copy it elsewhere.
https://onlinegdb.com/BJbsn9-cL
I just need the password to be changed on a user input not other strings, the reason for all this code is that when changing a person's password another username could be changed.This is the original file
This is what happens afterwards, the second string in the list of the line which where data[0] = "bye" should only be changed to newpass, not all of the others
'''
import linecache
f = open("database.txt" , "r+")
for loop in range(3):
line = f.readline()
data = line.split(",")
if data[1] == "bye":
print(data[1]) #These are to help me understand what is happening
print(data[0])
b = data[0]
newpass = "Hi"
a = data[1]
fn = 'database.txt'
e = open(fn)
output = []
str="happy"
for line in e:
if not line.startswith(str):
output.append(line)
e.close()
print(output)
e = open(fn, 'w')
e.writelines(output)
e.close()
line1 = linecache.getline("database.txt" ,loop+1)
print(line)
password = True
print("Password Valid\n")
write = (line1.replace(a, newpass))
write = f.write(line1.replace(a, newpass))
f.close()
'''
This is the file in text:
username,password,Recovery1,Answer1,Recovery2,Answer2,Recovery3,Answer3,Recovery4,Answer4,
Recovery5,Answer5,o,o,o,o,o,o,o,o,o,o,
happy,bye,o,o,o,o,o,o,o,o,o,o,
bye,happy,o,o,o,o,o,o,o,o,o,o,
Support is very much appreciated
Feel free to change the code as much as you need to, as it is already a mess
Thanks in Advance
This should be pretty easy. The basic idea is:
open input file for reading
open output file for writing
for each line in input file
if password = "happy"
change user name in line
write line to output file
It should be pretty easy to convert that to python.
From comments, and by examining your code, I get the feeling that you're trying to update a line in-place. That is, it looks like your expectation is that given the file "database.txt" that contains this:
username,password,Recovery1,Answer1,Recovery2,Answer2,Recovery3,Answer3, Recovery4,Answer4,Recovery5,Answer5,
o,o,o,o,o,o,o,o,o,o,
happy,bye,o,o,o,o,o,o,o,o,o,o,
bye,happy,o,o,o,o,o,o,o,o,o,o,
When you make the change, your new "database.txt" will contain this:
username,password,Recovery1,Answer1,Recovery2,Answer2,Recovery3,Answer3, Recovery4,Answer4,Recovery5,Answer5,
o,o,o,o,o,o,o,o,o,o,
happy,Hi,o,o,o,o,o,o,o,o,o,o,
bye,happy,o,o,o,o,o,o,o,o,o,o,
You can do that, but you can't do it in-place. You have to write all the lines of the file, including the changed line, to a new temporary file. Then you can delete the old "database.txt" and rename the temporary file.
You can't update a line in a text file, because if you change the length of the line then you'll either end up with extra space at the end of the line you changed (because the new line has fewer characters than the old line), or you'll overwrite the beginning of the next line (the new line is longer than the old line).
The only other option is to load all of the lines into memory and close the file. Then change the line or lines you want to change, in memory. Finally, open the "database.txt" file for writing and output all of the lines from memory to the file.
I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient
Thanks
Phil
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E
$INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50
$INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F
$INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C
$INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D
$INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D
$INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61
$INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A
$INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B
$INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68
$INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58
$INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79
$INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D
$INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56
$INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77
$INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40
$INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F
$INHDT,42.4,T*17
You can just read a line of file and write to another new file.
Like this:
import re
#open new file with append
nf = open('newfile', 'at')
#open file with read
with open('file', 'rt') as f:
for line in f:
r = re.match(r'\$INGGA', line)
if r is not None:
nf.write(line)
nf.write("$INHDT,31.9,T*1E" + '\n')
You can use at to append write and wt to read line!
I have 150,000 lines file, It's run well!
I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:
(\$INGGA.*\n\$INHDT.*\n)
https://regex101.com/r/tK1hF0/3
As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.
I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.
Here is some starter python example code:
import re
test_str = # load your file here
p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
In the example PuTTY log you give, its all one line separated with space.
So in this case you can use this to replace the space with new line and gets new file -
cat large_file | sed 's/ /\n/g' > new_large_file
To iterate over the file separated with new line, run this -
cat new_large_file | python your_script.py
Your script get line by line so your computer should not crash.
your_script.py -
import sys
INGGA_line = ""
for line in sys.stdin:
line_striped = line.strip()
if line_striped.startswith("$INGGA"):
INGGA_line = line_striped
elif line_striped.startswith("$INZDA"):
print line_striped, INGGA_line
else:
print line_striped
This answer is aimed at python 3.
According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:
with open(filename, 'r') as f:
for line in f:
...process...
An example of how you could fulfill your above criteria could be
# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
# Flag for whether we are looking for 1st or 2nd part
look_for_ingga = True
for line in sf:
if look_for_ingga:
if line.startswith('$INGGA,'):
tf.write(line)
look_for_ingga = False
elif line.startswith('$INHDT,'):
tf.write(line)
look_for_ingga = True
In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.
Refer to the docs for introductions to with-statements and file reading/writing.
I have a text file that contains key value pairs separated by a tab like this:
KEY\tVALUE
I have opened this file in append mode(a+) so I can both read and write. Now it may happen that a particular key has more than 1 value. For that I want to be able to go to that particular key and write the next value beside original one separated by a some delimiter(or ,).
Here is what I wish to do:
import io
ft = io.open("test.txt",'a+')
ft.seek(0)
for line in ft:
if (line.split('\t')[0] == "querykey"):
ft.write(unicode("nextvalue"));#Write the another key value beside the original one
Now there are two problems with it:
I will iterate through the file to see on which line the key is present(Is there a faster way?)
I will write a string to the end of that line.
I would be grateful if I can get help with the second point.
The write function always writes at the end of file. How should I write to the end of a specific line? I have searched and have not got very clear answers as to how to do that
You can read whole of file content, do your edit and write edited content to file.
with open('test.txt') as f:
lines = f.readlines()
f= open('test.txt', 'w')#open file for write
for line in lines:
if line.split('\t')[0] == "querykey":
line = line + ',newkey'
f.write('\n'.join(lines))
I have two text files (that are not equal in number of lines/size). I would like to compare each line of the shorter text file with every line of the longer text file. As it compares, if there are any duplicate strings, I would like to have those removed. Lastly, I would like write the result to a new text file and print the contents.
Is there a simply script that can do this for me?
Any help would be much appreciated.
The text files are not very large. One has about 10 lines and the other has about 5. The code I have tried (that failed miserably) is below:
for line in file2:
line1 = line
for line in file1:
requested3 = file('request2.txt','a')
if fnmatch.fnmatch(line1,line):
line2 = line.replace(line,"")
requested3.write(line2)
if not fnmatch.fnmatch(line1,line):
requested3.write(line+'\n')
requested3.close()
with open(longfilename) as longfile, open(shortfilename) as shortfile, open(newfilename, 'w') as newfile:
newfile.writelines(line for line in shortfile if line not in set(longfile))
It's as simple as that. This will copy lines from shortfile to newfile, without having to keep them all in memory, if they also exist in longfile.
If you're on Python 2.6 or older, you would need to nest the with statements:
with open(longfilename) as longfile:
with open(shortfilename) as shortfile:
with open(newfilename, 'w') as newfile:
If you're on Python 2.5, you need to either:
from __future__ import with_statement
at the very top of your file, or just use
longfile = open(longfilename)
etc. and close each file yourself.
If you need to manipulate the lines, an explicit for loop is fine, the important part is set(). Looking up an item in a set is fast, looking up a line in a long list is slow.
longlines = set(line.strip_or_whatever() for line in longfile)
for line in shortfile:
if line not in longlines:
newfile.write(line)
Assuming the files are both plain text, each string is on a new line delimited with \n newline characters:
small_file = open('file1.txt','r')
long_file = open('file2.txt','r')
output_file = open('output_file.txt','w')
try:
small_lines = small_file.readlines()
long_lines = long_file.readlines()
small_lines_cleaned = [line.rstrip().lower() for line in small_lines]
long_file_lines = long_file.readlines()
long_lines_cleaned = [line.rstrip().lower() for line in long_lines]
for line in small_lines_cleaned:
if line not in long_lines_cleaned:
output_file.writelines(line + '\n')
finally:
small_file.close()
long_file.close()
output_file.close()
Explanation:
Since you can't get 'with' statements working, we open the files first using regular open functions, then use a try...finally clause to close them at the end of the program.
We take the small file and the long file and first remove any trailing '\n' (newline) characters with .rstrip(), then make all the characters lower-case with .lower(). If you have two sentences identical in every aspect except one has upper case letters and the other doesn't, they wont' match. Forcing them lower case avoids that; if you prefer a case-sensitive compare, remove the .lower() method.
We go line by line in small_lines_cleaned (for line in...) and see if it is in the larger file.
Output each line if it is not in the longer file; we add the '\n' newline character so that each line will appear on a new line, insteadOfOneGiantLongSetOfStrings
I'd use difflib, it makes it easy to do comparisons/diffs. There is a nice tutorial for it here. If you just wanted the lines that were unique to the shorter file:
from difflib import ndiff
short = open('short.txt').readlines()
long = open('long.txt').readlines()
with open('unique.txt', 'w') as f:
f.write(''.join(x[2:] for x in ndiff(short, long) if x.startswith('-')))
Your code as it stands checks each line against the line in the other file. But that's not what you want. For each line in the first file, you need to check whether any line in the other file matches and then print it out if there are no matches.
The following code reads file two and checks it against file one.Anything that's in file one but not in file two will get printed and also written to a new text file.
If you wanted to do the opposite, you'd just get rid of the "not" from if statement below. So it'd print anything that's in file one and in file two.
It works by putting the contents of the shorter file (file two) in a variable and then reading the longer file (file one) line by line. Each line is checked against the variable and then the line is either written or not written to the text file according to it's presence in the variable.
(Remember to remove the stars surrounding the not statement if you wish to use it, or removing the not statement all together if you want it to print the matching words.)
fileOne = open("LONG FILE.ext","r")
fileTwo = open("SHORT FILE.ext","r")
fileThree = open("Results.txt","a+")
contents = fileTwo.read()
numLines = sum(1 for line in fileOne)
for i in range (numLines):
if **not** fileOne.readline(i) in contents:
print (fileOne.readline(i))
fileThree.write (fileOne.readline(i))
fileOne.close()
fileTwo.close()
fileThree.close()