strings in file do not match to string in a set - python

I have a file with a word in each line and a set with words, and I want to put not equal words from set called 'out' to the file. There is part of my code:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
fin = open(self.finalFile,"r")
out = set()
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
fin.close()
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
but it only match a bit of real equal words. I play with the same dictionary of words and it add repeat words to file each run. What I am doing wrong?? what happening?? I try to use '==' and 'is' comparators and I have the same result.
Edit 1: I am working with huge files(finalFile), which can't be full loaded at RAM, so I think I should read file line by line
Edit 2: Found big problem with pointer:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
out = set()
out.clear()
with open(self.finalFile,"r") as fin:
for word in self.lines_seen:
fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
if word in fin:
self.totalmatches = self.totalmatches+1
else:
out.add(word)
self.totalLines=self.totalLines+1
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
If I put the lines_seen bucle before opening the file, I open the file for each line in lines_seen, but speed ups to 30k lines/second only. With set() I am having 200k lines/second at worst, so I think I will load the file by parts and compare it using sets. Any better solution?
Edit 3: Done!

fin is a filehandle so you can't compare it with if line not in fin. The content needs to be read first.
with open(self.finalFile, "r") as fh:
fin = fh.read().splitlines() # fin is now a list of words from finalFile
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
# remove fin.close()
EDIT:
Since lines_seen is a set, try to create a new set with the words from finalFile then diff the sets?
file_set = set()
with open(self.finalFile, "r") as fh:
for f_line in fh:
new_set.add(f_line.strip())
# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)

Your comparison is likely not working because the lines read from the file will have a newline at the end, so you are comparing 'word\n' to 'word'. Using 'rstrip' will help remove the trailing newlines:
>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'
I would also iterate over the file, rather than iterate over the variable containing the words you would like to check against. If I've understood your code, you would like to write anything that is in self.lines_seen to self.finalFile, if it is not already in it. If you use 'if line not in fin' as you have, this will not work as you're expecting. For example, if your file contains:
lineone
linetwo
linethree
and the set lines_seen, being unordered, returns 'linethree' and then 'linetwo', then the following will match 'linethree' but not 'linetwo' because the file object has already read past it:
with open(self.finalFile,"r" as fin:
for line in self.lines_seen:
if line not in fin:
print line
Instead, consider using a counter:
from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
for line in fin:
line = line.rstrip() # remove the right-most whitespace/newline
linecount[line] += 1
for word in self.lines_seen:
if word not in linecount:
out.add(word)

Related

How to open a file in python, read the comments ("#"), find a word after the comments and select the word after it?

I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge

for loop file read line and filter based on list remove unnecessary empty lines

I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.
my_id_list = [
4985439
5605471
6144703
]
input file:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
my attempt:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
Question is:
It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?
update:
output file should be for this example:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
try something like this
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
getting the first word of each line is wasteful, since this:
first_word = line.split()[0]
creates a list of all "words" in the line when we just need the first one.
If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:
first_word = line.split(' ', 1)[0]

Python - file lines - pallindrome

I have been doing python tasks for learning and I came across this task where I have to read a file that includes few words and if a line is palindrome (same when written backwards: lol > lol)
so I tried with this code but It doesn't print anything on the terminal:
with open("words.txt") as f:
for line in f:
if line == line[::-1]:
print line
But if I print like this, without an if condition, it prints the words:
with open("words.txt") as f:
for line in f:
print line
I wonder why It wont print the words that I've written in the file:
sefes
kurwa
rawuk
lol
bollob
This is because those lines contain "\n" on the end. "\n" means new line. Therefore none of those are palindromes according to python.
You can strip off the "\n" first by doing:
with open("words.txt") as f:
for line in f:
if line.strip() == line.strip()[::-1]:
print line
The last character of each line is a newline character ("\n"). You need to strip trailing newlines ("foo\n".strip()) before checking whether the line is a palindrome.
When you read a line from a file like this, you also get the newline character. So, e.g., you're seeing 'sefes\n', which when reversed is '\nsefes'. These two lines are of course not equal. One way to solve this is to use rstrip to remove these newlines:
with open("words.txt") as f:
for line in f:
line = line.rstrip()
if line == line[::-1]:
print line

Extracting lines from txt file with Python

I'm downloading mtDNA records off NCBI and trying to extract lines from them using Python. The lines I'm trying to extract either start with or contain certain keywords such as 'haplotype' and 'nationality' or 'locality'. I've tried the following code:
import re
infile = open('sequence.txt', 'r') #open in file 'infileName' to read
outfile = open('results.txt', 'a') #open out file 'outfileName' to write
for line in infile:
if re.findall("(.*)haplogroup(.*)", line):
outfile.write(line)
outfile.write(infile.readline())
infile.close()
outfile.close()
The output here only contains the first line containing 'haplogroup' and for example not the following line from the infile:
/haplogroup="T2b20"
I've also tried the following:
keep_phrases = ["ACCESSION", "haplogroup"]
for line in infile:
for phrase in keep_phrases:
if phrase in line:
outfile.write(line)
outfile.write(infile.readline())
But this doesn't give me all of the lines containing ACCESSION and haplogroup.
line.startswith works but I can't use this for lines where the word is in the middle of the line.
Could anyone give me an example piece of code to print the following line to my output for containing 'locality':
/note="origin_locality:Wales"
Any other advice for how I can extract lines containing certain words is also appreciated.
Edit:
/haplogroup="L2a1l2a"
/note="ethnicity:Ashkenazic Jewish;
origin_locality:Poland: Warsaw; origin_coordinates:52.21 N
21.05 E"
/note="TAA stop codon is completed by the addition of 3' A
residues to the mRNA"
/note="codons recognized: UCN"
In this case, using Peter's code, the first three lines are written to the outfile but not the line containing 21.05 E". How can I make an exception for /note=" and copy all of the lines until the second set of quotation marks, without copying the /note lines containing /note="TAA or /note="codons
edit2:
This is my current solution which is working for me.
stuff_to_write = []
multiline = False
with open('sequences.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases) or multiline:
do_not_write = False
if multiline and line.count('"') >= 1:
multiline = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
elif line.count('"') < 2:
multiline = True
if not do_not_write:
stuff_to_write.append(line)
This will search a file for matching phrases and will write those lines to a new file assuming anything after "note" doesn't match anything in remove_phrases.
It will read the input line by line to check if anything matches the words in keep_phrases, store all the values in a list, then write them to a new file on separate lines. Unless you need to write the new file line by line as the matches are found, it should be a lot faster this way since everything is written at the same time.
If you don't want it to be case sensitive, change the any(phrase in line to any(phrase.lower() in line.lower().
keep_phrases = ["ACCESSION", "haplogroup", "locality"]
remove_phrases = ['codon', 'TAA']
stuff_to_write = []
with open('C:/a.txt') as f:
for line in f.readlines():
if any(phrase in line for phrase in keep_phrases):
do_not_write = False
if 'note' in line:
if any(phrase in line.split('note')[1] for phrase in remove_phrases):
do_not_write = True
if not do_not_write:
stuff_to_write.append(line)
with open('C:/b.txt','w') as f:
f.write('\r\n'.join(stuff_to_write))

Adding words from a text file to a list (python 2.7)

Assuming I have a text file.
My goal is to write a function which receives a number of line to go over in the text file and returns a list, each cell in the list containing one word exactly from that line.
Any idea of how doing this ?
thanks
If you are working with small files:
def get_words(mifile, my_line_number):
with open(mifile) as f:
lines = f.readlines()
myline = lines[my_line_number] #first line is 0
return myline.split()
you get all the file lines in the list lines. This is not very efficient for VERY big files. In that case probably it would be better to iterate line by line until you arrive to the chosen line.
Given the filename and the line number (lineno), you could extract the words on that line this way:
Assuming the lineno is not too large:
import linecache
line = linecache.getline(filename, lineno)
words = line.split()
Or, if the lineno is large:
import itertools
with open(filename,'r') as f:
line = next(itertools.islice(f,lineno-1,None))
words = line.split()
This,of course,assumes that words are separated by spaces--which may not be the case in hard-to-parse text.

Categories