I'm downloading mtDNA records off NCBI and trying to extract lines from them using Python. The lines I'm trying to extract either start with or contain certain keywords such as 'haplotype' and 'nationality' or 'locality'. I've tried the following code:
import re
infile = open('sequence.txt', 'r') #open in file 'infileName' to read
outfile = open('results.txt', 'a') #open out file 'outfileName' to write
for line in infile:
if re.findall("(.*)haplogroup(.*)", line):
outfile.write(line)
outfile.write(infile.readline())
infile.close()
outfile.close()
The output here only contains the first line containing 'haplogroup' and for example not the following line from the infile:
/haplogroup="T2b20"
I've also tried the following:
keep_phrases = ["ACCESSION", "haplogroup"]
for line in infile:
for phrase in keep_phrases:
if phrase in line:
outfile.write(line)
outfile.write(infile.readline())
But this doesn't give me all of the lines containing ACCESSION and haplogroup.
line.startswith works but I can't use this for lines where the word is in the middle of the line.
Could anyone give me an example piece of code to print the following line to my output for containing 'locality':
/note="origin_locality:Wales"
Any other advice for how I can extract lines containing certain words is also appreciated.
Edit:
/haplogroup="L2a1l2a"
/note="ethnicity:Ashkenazic Jewish;
origin_locality:Poland: Warsaw; origin_coordinates:52.21 N
21.05 E"
/note="TAA stop codon is completed by the addition of 3' A
residues to the mRNA"
/note="codons recognized: UCN"
In this case, using Peter's code, the first three lines are written to the outfile but not the line containing 21.05 E". How can I make an exception for /note=" and copy all of the lines until the second set of quotation marks, without copying the /note lines containing /note="TAA or /note="codons
edit2:
This is my current solution which is working for me.
stuff_to_write = []
multiline = False
with open('sequences.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases) or multiline:
do_not_write = False
if multiline and line.count('"') >= 1:
multiline = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
elif line.count('"') < 2:
multiline = True
if not do_not_write:
stuff_to_write.append(line)
This will search a file for matching phrases and will write those lines to a new file assuming anything after "note" doesn't match anything in remove_phrases.
It will read the input line by line to check if anything matches the words in keep_phrases, store all the values in a list, then write them to a new file on separate lines. Unless you need to write the new file line by line as the matches are found, it should be a lot faster this way since everything is written at the same time.
If you don't want it to be case sensitive, change the any(phrase in line to any(phrase.lower() in line.lower().
keep_phrases = ["ACCESSION", "haplogroup", "locality"]
remove_phrases = ['codon', 'TAA']
stuff_to_write = []
with open('C:/a.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases):
do_not_write = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
if not do_not_write:
stuff_to_write.append(line)
with open('C:/b.txt','w') as f:
f.write('\r\n'.join(stuff_to_write))
Related
I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library.
I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row.
This is what I have so far.
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
The output should: be every line in a tuple and all the tuples in a list.
My question is now, how can I read the other lines in python?
I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake.
Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code.
But much better is to use module csv for this.
You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines.
But after changing this there are still other problems so it still will not read next lines.
You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = ''
You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while
def read_csv(path):
with open(path, 'r+') as file:
results = []
text = file.read()
for line in text.split('\n'):
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
lines = file.readlines()
for line in lines:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
for line in file:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal:
with open(path, 'r+') as file:
results=[]
text = file.read()
#while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
Your Code:
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
So enjoy learning :)
One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley)
To prevent this you have to check for empty lines: if line != '': after the for will do the trick.
I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge
Basically I want to remove a line in my bank.txt file which has names, account numbers and a balance in it.
So far I have how to setup the file and how to check that the information is in the file I'm just not sure as to how I'm going to remove the certain line if the information is the same as what the input looks for.
Any help is appreciated and sorry if I've gotten either this question or the actual code itself formatted incorrectly for the question, don't really use this site much so far.
Thanks in advance.
filename = "bank.txt"
word1 = str(input("What is your name?"))
with open(filename) as f_obj:
for line in f_obj:
if word1 in line:
print(line.rstrip())
print("True")
else:
print("False")
First lets open up your file and load its contents into a list:
with open("bank.txt", 'r') as f:
lines = f.readlines()
Now that we have all the lines stored as a list, you can iterate through them and remove the ones you don't want. For example, lets say that I want to remove all lines with the word 'bank'
new_lines = []
for line in lines:
if 'bank' not in lines:
new_lines.append(line)
new_lines is now a list of all the lines that we actually want. So we can go back and update our file
with open("bank.txt", 'w+') as f:
to_write = ''.join(new_lines) #convert the list into a string we can write
f.write(new_lines)
Now no lines in the text file have the word 'bank'
This code works in case you want to delete multiple lines and not just the first. I also sought to approximate your code as much as possible. The user input is treated the same way, the deleted lines are printed and "True" or "False" are printed depending on whether a line was deleted or not.
deleted = False # becomes True if a line is removed
# get lines from file
f = open("bank.txt","r")
lines = f.readlines()
f.close()
# get name from user
word1 = str(input("What is your name? "))
# open for writing
f = open("bank.txt","w")
# reprint all lines but the desired ones
for i in range(len(lines)):
if word1 in lines[i]:
print(lines[i])
deleted = True
else:
f.write(lines[i])
# close file
f.close()
print(str(deleted))
I have a file with a word in each line and a set with words, and I want to put not equal words from set called 'out' to the file. There is part of my code:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
fin = open(self.finalFile,"r")
out = set()
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
fin.close()
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
but it only match a bit of real equal words. I play with the same dictionary of words and it add repeat words to file each run. What I am doing wrong?? what happening?? I try to use '==' and 'is' comparators and I have the same result.
Edit 1: I am working with huge files(finalFile), which can't be full loaded at RAM, so I think I should read file line by line
Edit 2: Found big problem with pointer:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
out = set()
out.clear()
with open(self.finalFile,"r") as fin:
for word in self.lines_seen:
fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
if word in fin:
self.totalmatches = self.totalmatches+1
else:
out.add(word)
self.totalLines=self.totalLines+1
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
If I put the lines_seen bucle before opening the file, I open the file for each line in lines_seen, but speed ups to 30k lines/second only. With set() I am having 200k lines/second at worst, so I think I will load the file by parts and compare it using sets. Any better solution?
Edit 3: Done!
fin is a filehandle so you can't compare it with if line not in fin. The content needs to be read first.
with open(self.finalFile, "r") as fh:
fin = fh.read().splitlines() # fin is now a list of words from finalFile
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
# remove fin.close()
EDIT:
Since lines_seen is a set, try to create a new set with the words from finalFile then diff the sets?
file_set = set()
with open(self.finalFile, "r") as fh:
for f_line in fh:
new_set.add(f_line.strip())
# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)
Your comparison is likely not working because the lines read from the file will have a newline at the end, so you are comparing 'word\n' to 'word'. Using 'rstrip' will help remove the trailing newlines:
>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'
I would also iterate over the file, rather than iterate over the variable containing the words you would like to check against. If I've understood your code, you would like to write anything that is in self.lines_seen to self.finalFile, if it is not already in it. If you use 'if line not in fin' as you have, this will not work as you're expecting. For example, if your file contains:
lineone
linetwo
linethree
and the set lines_seen, being unordered, returns 'linethree' and then 'linetwo', then the following will match 'linethree' but not 'linetwo' because the file object has already read past it:
with open(self.finalFile,"r" as fin:
for line in self.lines_seen:
if line not in fin:
print line
Instead, consider using a counter:
from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
for line in fin:
line = line.rstrip() # remove the right-most whitespace/newline
linecount[line] += 1
for word in self.lines_seen:
if word not in linecount:
out.add(word)
Assuming I have a text file.
My goal is to write a function which receives a number of line to go over in the text file and returns a list, each cell in the list containing one word exactly from that line.
Any idea of how doing this ?
thanks
If you are working with small files:
def get_words(mifile, my_line_number):
with open(mifile) as f:
lines = f.readlines()
myline = lines[my_line_number] #first line is 0
return myline.split()
you get all the file lines in the list lines. This is not very efficient for VERY big files. In that case probably it would be better to iterate line by line until you arrive to the chosen line.
Given the filename and the line number (lineno), you could extract the words on that line this way:
Assuming the lineno is not too large:
import linecache
line = linecache.getline(filename, lineno)
words = line.split()
Or, if the lineno is large:
import itertools
with open(filename,'r') as f:
line = next(itertools.islice(f,lineno-1,None))
words = line.split()
This,of course,assumes that words are separated by spaces--which may not be the case in hard-to-parse text.