python how to find all word matches between two text files - python

I have two text files comprised of around 370k words each, one word per line. One of the files is a list of English words and the other is a list of random gibberish words. I basically want to check if any of the random words are in fact real words, so I want to compare every line in one file with every line in the other file.
I've tried the following:
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
for line in f1:
if line in f2:
print(line)
and this gives me about 3 results before the program inexplicably ends without an error.
Is there a better way of doing this?

For what i understood you want the intersection of both lists, you can try this:
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
print(set(f1.readlines()).intersection(f2.readlines()))
f1.close()
f2.close()

Related

Comparing two lines from two text files according to a single part of the text file

I have two text files and I want to write out two new text files according to whether there is a common section to each line in the two original text files.
The format of the text files is as follows:
commontextinallcases uniquetext2 potentiallycommontext uniquetext4
There are more than 4 columns but you get the idea. I want to check the 'potentiallycommontext' part in each text file and if they are the same write out the whole line of each text file to a new text file for each with its own unique text still in place.
Spliting it is fairly easy just using the .split() command when reading it in. I have found the following code:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
But I am not sure this would work for my case where I need to split the lines. Is there a way to do this I am missing?
Thanks
I don't think, that this set-approach is suitable for your case.
I'd try like
with open('some_file_1.txt', 'r') as file1, open('some_file_2.txt', 'r') as file2, open('some_output_file.txt', 'w') as file_out:
for line1, line2 in zip(file1, file2):
if line1.split()[2] == line2.split()[2]:
file_out.write(line1)
file_out.write(line2)
There might be shorter solutions but this should work
PCT_IDX = _ # find which index of line.split() corresponds to potentiallycommontext
def lines(filename):
with open(filename, 'r') as file:
for line in file:
line = line.rstrip('\n')
yield line
lines_1 = lines('some_file_1.txt')
lines_2 = lines('some_file_2.txt')
with open('some_output_file.txt', 'w') as file_out:
for (line_1, line_2) in zip(lines_1, lines_2):
maybe_cmn1 = line_1.split()[PCT_IDX]
maybe_cmn2 = line_2.split()[PCT_IDX]
if maybe_cmn1 == maybe_cmn2:
file_out.write(line_1)
file_out.write(line_2)

Python - How to compare two files and output only the different lines in a third file

I have searched for similar questions on SO but didn't find anything that worked for me.
I have two large files, they should be the same but one of the files is 60 lines longer than the other. I want to know what these lines are and where I can find them.
I've read that one can use difflib to do this but I can't figure out how to go about it. I always get + and - in the file but I don't want that. I just want to scan through both files and report the uncommon 60 lines into a third file.
I wrote this code but it does not print out the different lines.
f1 = open('file1.txt','r')
f2 = open('file2.txt','r')
f3 = open('file3.txt','w')
diff = set(f1).difference(f2)
same.discard('\n')
for line in same:
f3.write(line)
Well, you could do something like this:
with open('file1.txt') as infile:
f1 = infile.readlines()
with open('file2.txt') as infile:
f2 = infile.readlines()
only_in_f1 = [i for i in f1 if i not in f2]
only_in_f2 = [i for i in f2 if i not in f1]
with open('file3.txt', 'w') as outfile:
if only_in_f1:
outfile.write('Lines only in file 1:\n')
for line in only_in_f1:
outfile.write(line)
if only_in_f2:
outfile.write('Lines only in file 2:\n')
for line in only_in_f2:
outfile.write(line)
Note: same content in different lines is treated as a difference
You can easily solve this using sets.
set1 = set()
with open(file1) as f:
for line in f:
set1.add(line.strip())
#Repeat for set 2
with open(diff_file, 'w') as f:
for line in set2 - set1:
f.write(line + '\n')

Python - Comparing files delimiting characters in line

there.
I'm a begginer in python and I'm struggling to do the following:
I have a file like this (+10k line):
EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"
EgrG_000095800 /product="DNA polymerase epsilon subunit 3"
EgrG_000095850 /product="crossover junction endonuclease EME1"
EgrG_000095900 /product="lysine specific histone demethylase 1A"
EgrG_000096000 /product="charged multivesicular body protein 6"
EgrG_000096100 /product="NADH ubiquinone oxidoreductase subunit 10"
and this one (+600 lines):
EgrG_000076200.1
EgrG_000131300.1
EgrG_000524000.1
EgrG_000733100.1
EgrG_000781600.1
EgrG_000094950.1
All the ID's of the second file are in the first one,so I want the lines of the first file corresponding to ID's of the second one.
I wrote the following script:
f1 = open('egranulosus_v3_2014_05_27.tsv').readlines()
f2 = open('eg_es_final_ids').readlines()
fr = open('res.tsv','w')
for line in f1:
if line[0:14] == f2[0:14]:
fr.write('%s'%(line))
fr.close()
print "Done!"
My idea was to search the id's delimiting the characters on each line to match EgrG_XXXX of one file to the other, an then, write the lines to a new file.
I tried some modifications, that's just the "core" of my idea.
I got nothing. In one of the modifications, I got just one line.
I'd store the ids from f2 in a set and then check f1 against that.
id_set = set()
with open('eg_es_final_ids') as f2:
for line in f2:
id_set.add(line[:-2]) #get rid of the .1
with open('egranulosus_v3_2014_05_27.tsv') as f1:
with open('res.tsv', 'w') as fr:
for line in f1:
if line[:14] in id_set:
fr.write(line)
with open('egranulosus_v3_2014_05_27.txt', 'r') as infile:
line_storage = {}
for line in infile:
data = line.split()
key = data[0]
value = line.replace('\n', '')
line_storage[key] = value
with open('eg_es_final_ids.txt', 'r') as infile, open('my_output.txt', 'w') as outfile:
for line in infile:
lookup_key = line.split('.')[0]
match = line_storage.get(lookup_key)
outfile.write(''.join([str(match), '\n']))
f2 is a list of lines in file-2. Where are you iterating over the list, like you are doing for lines in file-1 (f1).
That seems to be the problem.

comparing two text files and extracting text in python

I have two text files, one of them contains list of ids with numbers, and the other one contains list of ids with text. I want to compare two files, and for the lines having the same id print the text inside parentheses. This is what I have so file:
import fileinput
import sys
def clean(file1):
with open(sys.argv[1], 'r') as file1: #file ppx
for line in file1:
words=line.split()
id1=words[-1]
with open(sys.argv[2], 'r') as file2: #file ids
for line in file2:
words2=line.split()
id2=words2[0]
for line in file1:
if id1==id2[0]:
text=s[s.find("(")+1:s.find(")")]
print text
The first file looks like this: http://pastebin.com/PCU6f7vz
The second file looks like this: http://pastebin.com/Y2F3gkQv
But it does not work. Can somebody tell me why?
def clean(file1):
with open(sys.argv[1], "r") as file1:
file1_lines = file1.readlines()
id1 = [line.strip().split() for line in file1_lines]
with open(sys.argv[2], "r") as file2:
file2_lines = file2.readlines()
id2 = [line.strip().split() for line in file2_lines]
id2_dict = {i[-1]:i[:-1] for i in id2}
#You can print id2_dict and id1.
#print id2_idct,
#print id1
for index, line in enumerate(file1_lines):
id1 = id1[index].strip("(").strip(")")
if id1 in id2_dict:
text = line[line.find("(")+1:line.find(")")]
print text
#or:
#text_lines = [line[line.find("(")+1:line.find(")")] for index, line in enumerate(file1_lines) if id1 in id2_dict]
#print text_lines
I don't know your mind about the output of programming, so I just think you wanted to get text_lines
file1 is an iterator that is exhausted after all the lines in the file have been read (which will happen during the first for loop). Therefore, the following loop
for line in file1:
will never run. But even if it did, the condition
if id1==id2[0]:
will never be true because you're comparing the entire id1 to the first character of id2. Furthermore, you'd be doing exactly the same comparison over and over again since those variables aren't even connected to the iterable.
And in your first two loops, you're constantly overwriting the exact same variables.
I think you need to read up on Python basics, especially the chapter on loops in the Python tutorial...
To compare the same lines(line no) in the two files:
file1 = open(sys.argv[1], "r")
file2 = open(sys.argv[2], "r")
for line1, line2 in file1,file2:
if(line1.split()[-1] == line2.split()[0]):
print line1 # use regex to extract the infromation needed
file1.close()
file2.close()
Make sure to close the files after use.

Two files combination

The first file looks something like this:
writing
writing
writing
writing
eating
eating
eating
doing
doing
doing
...
The second file looks this way:
writing write wrote written
eating eat ate
doing do does done
...
So basically, I need to add words (word by word) from the second file to each line of a first file (sequentially one word per line) and save it in a third file which would look like this:
writing writing
writing write
writing wrote
writing written
eating eating
eating eat
eating ate
doing doing
doing do
doing does
doing done
...
I tried this code but it does not do the job:
infile = open("first.txt", 'r') # open file for reading
infile2 = open("second.txt", 'r') # open file for reading
outfile = open("third.txt","w") # open file for writing
line = infile.readline()
line2 = infile2.readline() # Invokes readline() method on file
while line:
outfile.write(line.strip(' ')+line2.strip("\n")+'\n')
line = infile.readline()
line2 = infile2.readline()
infile.close()
outfile.close()
infile2.close()
Why do you even need the first file?
infile2 = open('second.txt', 'r')
outfile = open('third.txt', 'w')
for line in infile2:
words = line.split()
outfile.write('\n'.join('%s %s' % (words[0], w) for w in words) + '\n')
outfile.close()
infile2.close()
To put your two files together I would read both completely and split them in different ways to get your words and then put them together.
Load the first file. In the first file there are one word per line, so read each line and store it into a list:
words_first = []
with open('first.txt') as f:
for line in f:
words_first.append(line)
Load the second file. The second file has multiple words per line and multiple lines, so read each line and split it into words and store it into a list:
words_second = []
with open('second.txt') as f:
for line in f:
words_second.extend(line.split(" "))
Store into the new file. Now you have two list of words, so use zip to pack them together and store them into the file:
with open('third.txt', 'w') as f:
for first, second in zip(words_first, words_second):
f.write("{0} {1}\n".format(first, second))
This version utilizes split() (which splits all white space (newlines and spaces)), so you can split the complete files and get a list of all words separated by newlines and spaces:
def get_words(file_path):
with open(file_path) as f:
return f.read().split()
with open('third.txt', 'w') as f:
for first, second in zip(get_words("first.txt"), get_words("second.txt")):
f.write("{0} {1}".format(first, second))

Categories