I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))
I have a folder full of .GPS files, e.g. 1.GPS, 2.GPS, etc...
Within each file is the following five lines:
Trace #1 at position 0.004610
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,39.0304,T,39.0304,M,0.029,N,0.054,K,D*32
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27
...followed by the same data structure, with different values, over the next five lines:
Trace #6 at position 0.249839
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,247.2375,T,247.2375,M,0.081,N,0.149,K,D*3D
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F
(I realise the values after the $GNGSA lines don't vary in the above example. This is just a bad example... in the real dataset they do vary!)
I need to remove the lines that begin with "$GNGSA" and "$GNVTG" (i.e. I need to delete lines 2, 3, and 4 from each group of five lines within each .GPS file).
This five-line pattern continues for a varying number of times throughout each file (for some files, there might only be two five-line groups, while other files might have hundreds of the five-line groups). Hence, deleting these lines based on the line number will not work (because the line number would be variable).
The problem I am having (as seen in the above examples) is that the text that follows the "$GNGSA" or "$GNVTG" varies.
I'm currently learning Python (I'm using v3.5), so figured this would make for a good project for me to learn a few new tricks...
What I've tried already:
So far, I've managed to create the code to loop through the entire folder:
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.endswith('.GPS'): # if the filename of an iteration ends with .GPS, then...
print(i + ' loaded') # print the filename to CLI, simply for debugging purposes.
with open(indir + i, 'r') as my_file: # open the iteration file
file_lines = my_file.readlines() # uses the readlines method to create a list of all lines in the file.
print(file_lines) # this prints the entire contents of each file to CLI for debugging purposes.
Everything in the above works perfectly.
What I need help with:
How do I detect and delete the lines themselves, and then save the file (to the same location; there is no need to save to a different filename)?
The filenames - which usually end with ".GPS" - sometimes end with ".gps" instead (the only difference being the case). My above code will only work with the uppercase files. Besides completely duplicating the code and changing the endswith argument, how do I make it work with both cases?
In the end, my file needs to look something like this:
Trace #1 at position 0.004610
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27
Trace #6 at position 0.249839
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F
Any suggestions, please? Thanks in advance. :)
You're almost there.
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.endswith('.GPS'): # if the filename of an iteration ends with .GPS, then...
print(i + ' loaded') # print the filename to CLI, simply for debugging purposes.
with open(indir + i, 'r') as my_file: # open the iteration file
for line in my_file:
if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'):
print(line)
As per what the others have said, you're on the right track! Where you're going wrong is in the case-sensitive file extension check, and in reading in the entire file contents at once (this isn't per se wrong, but it's probably adding complexity we won't need).
I've commented your code, removing all the debug stuff for simplicity, to illustrate what I mean:
import os
indir = '/path/to/files'
for i in os.listdir(indir):
if i.endswith('.GPS'): #This CASE SENSITIVELY checks the file extension
with open(indir + i, 'r') as my_file: # Opens the file
file_lines = my_file.readlines() # This reads the ENTIRE file at once into an array of lines
So we need to fix the case sensitivity issue, and instead of reading in all the lines, we'll instead read the file line-by-line, check each line to see if we want to discard it or not, and write the lines we're interested in into an output file.
So, incorporating #tdelaney's case-insensitive fix for file name, we replace line #5 with
if i.lower().endswith('.gps'): # Case-insensitively check the file name
and instead of reading in the entire file at once, we'll instead iterate over the file stream and print each desired line out
with open(indir + i) as in_file, open(indir + i + 'new.gps') as out_file: # Open the input file for reading and creates + opens a new output file for writing - thanks #tdelaney once again!
for line in in_file # This reads each line one-by-one from the in file
if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'): # Check the line has what we want (thanks Avinash)
out_file.write(line + "\n") # Write the line to the new output file
Note that you should make certain that you open the output file OUTSIDE of the 'for line in in_file' loop, or else the file will be overwritten on every iteration which will erase what you've already written to it so far (I suspect this is the issue you've had with the previous answers). Open both files at the same time and you can't go wrong.
Alternatively, you can specify the file access mode when you open the file, as per
with open(indir + i + 'new.gps', 'a'):
which will open the file in append-mode, which is a specialised from of write-mode that preserves the original contents of the file, and appends new data to it instead of overwriting existing data.
Ok, based on suggestions by Avinash Raj, tdelaney, and Sampson Oliver, here on Stack Overflow, and another friend who helped privately, here is the solution that is now working:
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.lower().endswith('.gps'): # if the filename of an iteration ends with .GPS, then...
if not i.lower().endswith('.gpsnew.gps'): # if the filename does not end with .gpsnew.gps, then...
print(i + ' loaded') # print the filename to CLI.
with open (indir + i, 'r') as my_file:
for line in my_file:
if not line.startswith('$GNGSA'):
if not line.startswith('$GNVTG'):
with open(indir + i + 'new.gps', 'a') as outputfile:
outputfile.write(line)
outputfile.write('\r\n')
(You'll see I had to add in another layer of if statement to stop it from using the output files from previous uses of the script "if not i.lower().endswith('.gpsnew.gps'):", but this line can easily be deleted for anyone who uses these instructions in future)
We switched the open mode on the third-last line to "a" for append, so that it would save all the right lines to the file, rather than overwriting each time.
We also added in the final line to add a line break at the end of each line.
Thanks everyone for their help, explanations, and suggestions. Hopefully this solution will be useful to someone in future. :)
2. The filenames:
The if accepts any expression returning a truth value, and you can combine expressions with the standart boolean operators: if i.endswith('.GPS') or i.endswith('.gps').
You can also put the ... and ... expression after the if in brackets, to feel more sure, but it's not neccessary.
Alternatively, as a less universal solution, (but since you wanted to learn a few tricks :)) you can use string manipulation in this case: an object of type string has a lot of methods. '.gps'.upper() gives '.GPS' -- try, if you can make use of this! (even a printed string is a string object, but your variables behave the same).
1. Finding the Lines:
As you can see in the other solution, you need not read out all of your lines, you can check if want to have them 'on the fly'. But I will stick to your approach with readlines. It gives you a list, and lists support indexing and slicing. Try:
anylist[stratindex, endindex, stride], for any values, so for example try: newlist = range(100)[1::5].
It's always helpfull to try out the easy basic operations in interactive mode, or at the beginning of your script. Here range(100) is just some sample list. Here you see, how the python for-syntax works, differently than in other languages: you can iterate over any list, and if you just need integers, you create a list with integers with range().
So this will work the same with any other list -- e.g. the one you get from readlines()
This selects a slice from the list, beginnig with the second element, ending at the end (since the end index is omitted), and taking every 5th element. Now you have this sub-list, you can just revome it from the original. So for the example with the range:
a = range(100)
del(a[1::5])
print a
So you see, that the appropriate items have been removed. Now do the same with your file_lines, and then proceed to remove the other lines you want to remove.
Then, in a new with block, open the file for writing and do writelines(file_lines), so the remainig lines are written back to the file.
Of course you can also take the approach to look for the content of each line with a for loop over your list and startswith(). Or you can combine the approaches, and check, if deleting lines by number leaves the right starts, so you can print an error if something is unexpected...
3. Saving the file
You can close your file after you have the lines saved in the readlines(). In fact this is done automatically at the end of the with-block. Then just open it in 'w' mode instead of 'r' and do yourfilename.writelines(yourlist). You don't need to save, it's saven on closing.
I'm looking for some help with my code which is rigth below :
for file in file_name :
if os.path.isfile(file):
for line_number, line in enumerate(fileinput.input(file, inplace=1)):
print file
os.system("pause")
if line_number ==1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
I wanted to modify some previous extracted files in order to plot them with matplotlib. To do so, I remove some lines, comment some others.
My problem is the following :
Using for line_number, line in enumerate(fileinput.input(file, inplace=1)): gives me only 4 out of 5 previous extracted files (when looking file_name list contains 5 references !)
Using for line_number, line in enumerate(file): gives me the 5 previous extracted file, BUT I don't know how to make modifications using the same file without creating another one...
Did you have an idea on this issue? Is this a normal issue?
There a number of things that might help you.
Firstly file_name appears to be a list of file names. It might be better named file_names and then you could use file_name for each one. You have verified that this does hold 5 entries.
The enumerate() function is used to help when enumerating a list of items to provide both an index and the item for each loop. This saves you having to use a separate counter variable, e.g.
for index, item in enumerate(["item1", "item2", "item3"]):
print index, item
would print:
0 item1
1 item2
2 item3
This is not really required, as you have chosen to use the fileinput library. This is designed to take a list of files and iterate over all of the lines in all of the files in one single loop. As such you need to tweak your approach a bit, assuming your list of files is called file_names then you write something as follows:
# Keep only files in the file list
file_names = [file_name for file_name in file_names if os.path.isfile(file_name)]
# Iterate all lines in all files
for line in fileinput.input(file_names, inplace=1):
if fileinput.filelineno() == 1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
The main point here being that it is better to pre filter any non-filenames before passing the list to fileinput. I will leave it up to you to fix the output.
fileinput provides a number of functions to help you figure out which file or line number is currently being processed.
Assuming you're still having trouble, my typical approach is to open a file read-only, read its contents into a variable, close the file, make an edited variable, open the file to write (wiping out original file), and finally write the edited contents.
I like this approach since I can simply change the file_name that gets written out if I want to test my edits without wiping out the original file.
Also, I recommend naming containers using plural nouns, like #Martin Evans suggests.
import os
file_names = ['file_1.txt', 'file_2.txt', 'file_3.txt', 'file_4.txt', 'file_5.txt']
file_names = [x for x in file_names if os.path.isfile(x)] # see #Martin's answer again
for file_name in file_names:
# Open read-only and put contents into a list of line strings
with open(file_name, 'r') as f_in:
lines = f_in.read().splitlines()
# Put the lines you want to write out in out_lines
out_lines = []
for index_no, line in enumerate(lines):
if index_no == 1:
out_lines.append(line.replace('Object', '#Object'))
elif ...
else:
out_lines.append(line)
# Uncomment to write to different file name for edits testing
# with open(file_name + '.out', 'w') as f_out:
# f_out.write('\n'.join(out_lines))
# Write out the file, clobbering the original
with open(file_name, 'w') as f_out:
f_out.write('\n'.join(out_lines))
Downside with this approach is that each file needs to be small enough to fit into memory twice (lines + out_lines).
Best of luck!
I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns
I can do this using a separate file, but how do I append a line to the beginning of a file?
f=open('log.txt','a')
f.seek(0) #get to the first position
f.write("text")
f.close()
This starts writing from the end of the file since the file is opened in append mode.
In modes 'a' or 'a+', any writing is done at the end of the file, even if at the current moment when the write() function is triggered the file's pointer is not at the end of the file: the pointer is moved to the end of file before any writing. You can do what you want in two manners.
1st way, can be used if there are no issues to load the file into memory:
def line_prepender(filename, line):
with open(filename, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write(line.rstrip('\r\n') + '\n' + content)
2nd way:
def line_pre_adder(filename, line_to_prepend):
f = fileinput.input(filename, inplace=1)
for xline in f:
if f.isfirstline():
print line_to_prepend.rstrip('\r\n') + '\n' + xline,
else:
print xline,
I don't know how this method works under the hood and if it can be employed on big big file. The argument 1 passed to input is what allows to rewrite a line in place; the following lines must be moved forwards or backwards in order that the inplace operation takes place, but I don't know the mechanism
In all filesystems that I am familiar with, you can't do this in-place. You have to use an auxiliary file (which you can then rename to take the name of the original file).
To put code to NPE's answer, I think the most efficient way to do this is:
def insert(originalfile,string):
with open(originalfile,'r') as f:
with open('newfile.txt','w') as f2:
f2.write(string)
f2.write(f.read())
os.remove(originalfile)
os.rename('newfile.txt',originalfile)
Different Idea:
(1) You save the original file as a variable.
(2) You overwrite the original file with new information.
(3) You append the original file in the data below the new information.
Code:
with open(<filename>,'r') as contents:
save = contents.read()
with open(<filename>,'w') as contents:
contents.write(< New Information >)
with open(<filename>,'a') as contents:
contents.write(save)
The clear way to do this is as follows if you do not mind writing the file again
with open("a.txt", 'r+') as fp:
lines = fp.readlines() # lines is list of line, each element '...\n'
lines.insert(0, one_line) # you can use any index if you know the line index
fp.seek(0) # file pointer locates at the beginning to write the whole file again
fp.writelines(lines) # write whole lists again to the same file
Note that this is not in-place replacement. It's writing a file again.
In summary, you read a file and save it to a list and modify the list and write the list again to a new file with the same filename.
num = [1, 2, 3] #List containing Integers
with open("ex3.txt", 'r+') as file:
readcontent = file.read() # store the read value of exe.txt into
# readcontent
file.seek(0, 0) #Takes the cursor to top line
for i in num: # writing content of list One by One.
file.write(str(i) + "\n") #convert int to str since write() deals
# with str
file.write(readcontent) #after content of string are written, I return
#back content that were in the file
There's no way to do this with any built-in functions, because it would be terribly inefficient. You'd need to shift the existing contents of the file down each time you add a line at the front.
There's a Unix/Linux utility tail which can read from the end of a file. Perhaps you can find that useful in your application.
If the file is the too big to use as a list, and you simply want to reverse the file, you can initially write the file in reversed order and then read one line at the time from the file's end (and write it to another file) with file-read-backwards module
An improvement over the existing solution provided by #eyquem is as below:
def prepend_text(filename: Union[str, Path], text: str):
with fileinput.input(filename, inplace=True) as file:
for line in file:
if file.isfirstline():
print(text)
print(line, end="")
It is typed, neat, more readable, and uses some improvements python got in recent years like context managers :)
I tried a different approach:
I wrote first line into a header.csv file. body.csv was the second file. Used Windows type command to concatenate them one by one into final.csv.
import os
os.system('type c:\\\header.csv c:\\\body.csv > c:\\\final.csv')
with open("fruits.txt", "r+") as file:
file.write("bab111y")
file.seek(0)
content = file.read()
print(content)