Python - Comparing files delimiting characters in line

Python - Comparing files delimiting characters in line - python

there.
I'm a begginer in python and I'm struggling to do the following:
I have a file like this (+10k line):
EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"
EgrG_000095800 /product="DNA polymerase epsilon subunit 3"
EgrG_000095850 /product="crossover junction endonuclease EME1"
EgrG_000095900 /product="lysine specific histone demethylase 1A"
EgrG_000096000 /product="charged multivesicular body protein 6"
EgrG_000096100 /product="NADH ubiquinone oxidoreductase subunit 10"
and this one (+600 lines):
EgrG_000076200.1
EgrG_000131300.1
EgrG_000524000.1
EgrG_000733100.1
EgrG_000781600.1
EgrG_000094950.1
All the ID's of the second file are in the first one,so I want the lines of the first file corresponding to ID's of the second one.
I wrote the following script:
f1 = open('egranulosus_v3_2014_05_27.tsv').readlines()
f2 = open('eg_es_final_ids').readlines()
fr = open('res.tsv','w')
for line in f1:
if line[0:14] == f2[0:14]:
fr.write('%s'%(line))
fr.close()
print "Done!"
My idea was to search the id's delimiting the characters on each line to match EgrG_XXXX of one file to the other, an then, write the lines to a new file.
I tried some modifications, that's just the "core" of my idea.
I got nothing. In one of the modifications, I got just one line.

I'd store the ids from f2 in a set and then check f1 against that.
id_set = set()
with open('eg_es_final_ids') as f2:
for line in f2:
id_set.add(line[:-2]) #get rid of the .1
with open('egranulosus_v3_2014_05_27.tsv') as f1:
with open('res.tsv', 'w') as fr:
for line in f1:
if line[:14] in id_set:
fr.write(line)

with open('egranulosus_v3_2014_05_27.txt', 'r') as infile:
line_storage = {}
for line in infile:
data = line.split()
key = data[0]
value = line.replace('\n', '')
line_storage[key] = value
with open('eg_es_final_ids.txt', 'r') as infile, open('my_output.txt', 'w') as outfile:
for line in infile:
lookup_key = line.split('.')[0]
match = line_storage.get(lookup_key)
outfile.write(''.join([str(match), '\n']))

f2 is a list of lines in file-2. Where are you iterating over the list, like you are doing for lines in file-1 (f1).
That seems to be the problem.

Related

python how to find all word matches between two text files

I have two text files comprised of around 370k words each, one word per line. One of the files is a list of English words and the other is a list of random gibberish words. I basically want to check if any of the random words are in fact real words, so I want to compare every line in one file with every line in the other file.
I've tried the following:
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
for line in f1:
if line in f2:
print(line)
and this gives me about 3 results before the program inexplicably ends without an error.
Is there a better way of doing this?

For what i understood you want the intersection of both lists, you can try this:
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
print(set(f1.readlines()).intersection(f2.readlines()))
f1.close()
f2.close()

Comparing two lines from two text files according to a single part of the text file

I have two text files and I want to write out two new text files according to whether there is a common section to each line in the two original text files.
The format of the text files is as follows:
commontextinallcases uniquetext2 potentiallycommontext uniquetext4
There are more than 4 columns but you get the idea. I want to check the 'potentiallycommontext' part in each text file and if they are the same write out the whole line of each text file to a new text file for each with its own unique text still in place.
Spliting it is fairly easy just using the .split() command when reading it in. I have found the following code:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
But I am not sure this would work for my case where I need to split the lines. Is there a way to do this I am missing?
Thanks

I don't think, that this set-approach is suitable for your case.
I'd try like
with open('some_file_1.txt', 'r') as file1, open('some_file_2.txt', 'r') as file2, open('some_output_file.txt', 'w') as file_out:
for line1, line2 in zip(file1, file2):
if line1.split()[2] == line2.split()[2]:
file_out.write(line1)
file_out.write(line2)

There might be shorter solutions but this should work
PCT_IDX = _ # find which index of line.split() corresponds to potentiallycommontext
def lines(filename):
with open(filename, 'r') as file:
for line in file:
line = line.rstrip('\n')
yield line
lines_1 = lines('some_file_1.txt')
lines_2 = lines('some_file_2.txt')
with open('some_output_file.txt', 'w') as file_out:
for (line_1, line_2) in zip(lines_1, lines_2):
maybe_cmn1 = line_1.split()[PCT_IDX]
maybe_cmn2 = line_2.split()[PCT_IDX]
if maybe_cmn1 == maybe_cmn2:
file_out.write(line_1)
file_out.write(line_2)

Python - How to compare two files and output only the different lines in a third file

I have searched for similar questions on SO but didn't find anything that worked for me.
I have two large files, they should be the same but one of the files is 60 lines longer than the other. I want to know what these lines are and where I can find them.
I've read that one can use difflib to do this but I can't figure out how to go about it. I always get + and - in the file but I don't want that. I just want to scan through both files and report the uncommon 60 lines into a third file.
I wrote this code but it does not print out the different lines.
f1 = open('file1.txt','r')
f2 = open('file2.txt','r')
f3 = open('file3.txt','w')
diff = set(f1).difference(f2)
same.discard('\n')
for line in same:
f3.write(line)

Well, you could do something like this:
with open('file1.txt') as infile:
f1 = infile.readlines()
with open('file2.txt') as infile:
f2 = infile.readlines()
only_in_f1 = [i for i in f1 if i not in f2]
only_in_f2 = [i for i in f2 if i not in f1]
with open('file3.txt', 'w') as outfile:
if only_in_f1:
outfile.write('Lines only in file 1:\n')
for line in only_in_f1:
outfile.write(line)
if only_in_f2:
outfile.write('Lines only in file 2:\n')
for line in only_in_f2:
outfile.write(line)
Note: same content in different lines is treated as a difference

You can easily solve this using sets.
set1 = set()
with open(file1) as f:
for line in f:
set1.add(line.strip())
#Repeat for set 2
with open(diff_file, 'w') as f:
for line in set2 - set1:
f.write(line + '\n')

comparing two text files and extracting text in python

I have two text files, one of them contains list of ids with numbers, and the other one contains list of ids with text. I want to compare two files, and for the lines having the same id print the text inside parentheses. This is what I have so file:
import fileinput
import sys
def clean(file1):
with open(sys.argv[1], 'r') as file1: #file ppx
for line in file1:
words=line.split()
id1=words[-1]
with open(sys.argv[2], 'r') as file2: #file ids
for line in file2:
words2=line.split()
id2=words2[0]
for line in file1:
if id1==id2[0]:
text=s[s.find("(")+1:s.find(")")]
print text
The first file looks like this: http://pastebin.com/PCU6f7vz
The second file looks like this: http://pastebin.com/Y2F3gkQv
But it does not work. Can somebody tell me why?

def clean(file1):
with open(sys.argv[1], "r") as file1:
file1_lines = file1.readlines()
id1 = [line.strip().split() for line in file1_lines]
with open(sys.argv[2], "r") as file2:
file2_lines = file2.readlines()
id2 = [line.strip().split() for line in file2_lines]
id2_dict = {i[-1]:i[:-1] for i in id2}
#You can print id2_dict and id1.
#print id2_idct,
#print id1
for index, line in enumerate(file1_lines):
id1 = id1[index].strip("(").strip(")")
if id1 in id2_dict:
text = line[line.find("(")+1:line.find(")")]
print text
#or:
#text_lines = [line[line.find("(")+1:line.find(")")] for index, line in enumerate(file1_lines) if id1 in id2_dict]
#print text_lines
I don't know your mind about the output of programming, so I just think you wanted to get text_lines

file1 is an iterator that is exhausted after all the lines in the file have been read (which will happen during the first for loop). Therefore, the following loop
for line in file1:
will never run. But even if it did, the condition
if id1==id2[0]:
will never be true because you're comparing the entire id1 to the first character of id2. Furthermore, you'd be doing exactly the same comparison over and over again since those variables aren't even connected to the iterable.
And in your first two loops, you're constantly overwriting the exact same variables.
I think you need to read up on Python basics, especially the chapter on loops in the Python tutorial...

To compare the same lines(line no) in the two files:
file1 = open(sys.argv[1], "r")
file2 = open(sys.argv[2], "r")
for line1, line2 in file1,file2:
if(line1.split()[-1] == line2.split()[0]):
print line1 # use regex to extract the infromation needed
file1.close()
file2.close()
Make sure to close the files after use.

Python: Sorting a two files based on the order of one

I've been trying to do this task all day, and I really want to learn how to do it using Python. I want to take two tab-delimited files, one with an ID only and the other with the same ID and some description. I can easily merge these files on the shared ID field with unix join, but for that I need to sort both and I want to keep the ordering of the first file.
Ive tried some code below, and my method has been to try and add things to a tuple, as from my understanding, they will keep their order as you add to it. I havent been able to get anything to work though. Can anyone help?
Sample files:
file1 ->
111889
1437390
123
27998
2525778
12
1345
file2 ->
2525778'\t'item778
1345'\t'item110
123'\t'item1000
12'\t'item8889
111889'\t'item1111
1437390'\t'item222
27998'\t'item12
output ->
111889'\t'item1111
1437390'\t'item222
123'\t'item1000
27998'\t'item12
2525778'\t'item778
12'\t'item8889
1345'\t'item110
This what I have so far:
import sys
add_list = ()
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
for line2 in file2:
f1, f2, f3 = line2.split('\t')
#print f1, f2, f3
for row in file1:
#print row
if row != f1:
break
else:
add_list.append(f1,f2,'\n')
break

The key is to use Python dictionaries, they are perfect for this task…
Here is a complete answer:
import sys
# Each id is mapped to its item name
# (split() splits at whitespaces (including tabulation and newline), with no empty output strings):
items = dict(line.split() for line in open(sys.argv[2])) # Inspired by mgilson's answer
with open(sys.argv[1]) as ids:
for line in ids:
id = line.rstrip() # newline removed
print '{}\t{}'.format(id, items[id])
Here is the result:
% python out.py file1.txt file2.txt
111889 item1111
1437390 item222
123 item1000
27998 item12
2525778 item778
12 item8889
1345 item110
PS: Note that I did not open the files in rb mode, as there is no need to keep the original newline bytes, here, since we get rid of trailing newlines.

I would create a dictionary which maps the ID to the field value from the second file:
with open('file2') as fin:
d = dict(x.split(None, 1) for x in fin)
Then I would use the first file to construct the output in order from the dictionary:
with open('file1') as fin, open('output', 'w') as fout:
for line in fin:
key = line.strip()
fout.write('{key}\t{value}\n'.format(key=key, value=d[key])

out = {}
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
d2 = {}
for line in file2:
(key, val) = line.split('\t')
d2[key] = val
lines = file1.readlines()
out = { x:d2[x] for x in lines }
I am not sure about your sorting basis.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Comparing files delimiting characters in line - python

f2 is a list of lines in file-2. Where are you iterating over the list, like you are doing for lines in file-1 (f1). That seems to be the problem.

Related

python how to find all word matches between two text files

Comparing two lines from two text files according to a single part of the text file

Python - How to compare two files and output only the different lines in a third file

comparing two text files and extracting text in python

Python: Sorting a two files based on the order of one

Categories

Resources