comparing two text files and extracting text in python

comparing two text files and extracting text in python - python

I have two text files, one of them contains list of ids with numbers, and the other one contains list of ids with text. I want to compare two files, and for the lines having the same id print the text inside parentheses. This is what I have so file:
import fileinput
import sys
def clean(file1):
with open(sys.argv[1], 'r') as file1: #file ppx
for line in file1:
words=line.split()
id1=words[-1]
with open(sys.argv[2], 'r') as file2: #file ids
for line in file2:
words2=line.split()
id2=words2[0]
for line in file1:
if id1==id2[0]:
text=s[s.find("(")+1:s.find(")")]
print text
The first file looks like this: http://pastebin.com/PCU6f7vz
The second file looks like this: http://pastebin.com/Y2F3gkQv
But it does not work. Can somebody tell me why?

def clean(file1):
with open(sys.argv[1], "r") as file1:
file1_lines = file1.readlines()
id1 = [line.strip().split() for line in file1_lines]
with open(sys.argv[2], "r") as file2:
file2_lines = file2.readlines()
id2 = [line.strip().split() for line in file2_lines]
id2_dict = {i[-1]:i[:-1] for i in id2}
#You can print id2_dict and id1.
#print id2_idct,
#print id1
for index, line in enumerate(file1_lines):
id1 = id1[index].strip("(").strip(")")
if id1 in id2_dict:
text = line[line.find("(")+1:line.find(")")]
print text
#or:
#text_lines = [line[line.find("(")+1:line.find(")")] for index, line in enumerate(file1_lines) if id1 in id2_dict]
#print text_lines
I don't know your mind about the output of programming, so I just think you wanted to get text_lines

file1 is an iterator that is exhausted after all the lines in the file have been read (which will happen during the first for loop). Therefore, the following loop
for line in file1:
will never run. But even if it did, the condition
if id1==id2[0]:
will never be true because you're comparing the entire id1 to the first character of id2. Furthermore, you'd be doing exactly the same comparison over and over again since those variables aren't even connected to the iterable.
And in your first two loops, you're constantly overwriting the exact same variables.
I think you need to read up on Python basics, especially the chapter on loops in the Python tutorial...

To compare the same lines(line no) in the two files:
file1 = open(sys.argv[1], "r")
file2 = open(sys.argv[2], "r")
for line1, line2 in file1,file2:
if(line1.split()[-1] == line2.split()[0]):
print line1 # use regex to extract the infromation needed
file1.close()
file2.close()
Make sure to close the files after use.

Related

Comparing two lines from two text files according to a single part of the text file

I have two text files and I want to write out two new text files according to whether there is a common section to each line in the two original text files.
The format of the text files is as follows:
commontextinallcases uniquetext2 potentiallycommontext uniquetext4
There are more than 4 columns but you get the idea. I want to check the 'potentiallycommontext' part in each text file and if they are the same write out the whole line of each text file to a new text file for each with its own unique text still in place.
Spliting it is fairly easy just using the .split() command when reading it in. I have found the following code:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
But I am not sure this would work for my case where I need to split the lines. Is there a way to do this I am missing?
Thanks

I don't think, that this set-approach is suitable for your case.
I'd try like
with open('some_file_1.txt', 'r') as file1, open('some_file_2.txt', 'r') as file2, open('some_output_file.txt', 'w') as file_out:
for line1, line2 in zip(file1, file2):
if line1.split()[2] == line2.split()[2]:
file_out.write(line1)
file_out.write(line2)

There might be shorter solutions but this should work
PCT_IDX = _ # find which index of line.split() corresponds to potentiallycommontext
def lines(filename):
with open(filename, 'r') as file:
for line in file:
line = line.rstrip('\n')
yield line
lines_1 = lines('some_file_1.txt')
lines_2 = lines('some_file_2.txt')
with open('some_output_file.txt', 'w') as file_out:
for (line_1, line_2) in zip(lines_1, lines_2):
maybe_cmn1 = line_1.split()[PCT_IDX]
maybe_cmn2 = line_2.split()[PCT_IDX]
if maybe_cmn1 == maybe_cmn2:
file_out.write(line_1)
file_out.write(line_2)

-python - concatenate text file and store it into variable

Is there a way to concatenate two text file without writting the result into a another file, but just store it in a variable?
What I'm looking for is something like
my_fun(cat(file1,file2))
where my_fun will read the result of the concatenation cat and use it as if it was a real text file.
In other word, I'd like to do
with open(my_fileOut,'w') as outfile:
for file in [file_1,file_2]:
with open(file,'r') as infile:
for line in infile:
outfile.write(line)
and remplace my_fileOut with a variable and therefore not make outfile.write(line) but store the result in memory
Thanks a lot in advance for any help or piece of advice,
Regards
PS : Sorry if my english is not very good

It looks like you just want to write a file in the end. So why not?:
def cat(f1, f2):
with open(f1, 'r') as f:
f1txt = f.read()
with open(f2, 'r') as f:
f2txt = f.read()
return f1txt + f2txt
def my_fun(f3, text):
with open(f3, 'w') as f:
f.write(text)
out = '/path/to/some/file'
file1 = '/path/to/file1'
file2 = '/path/to/file2'
my_fun(cat(file1, file2))
This will read all the data inside file1, then file2 and then add all the data from file2 to the end of the file1 data. If you mean to concatenate another way, please specify.

You can use itertools.chain():
from itertools import chain
def my_fun(f):
for line in f:
print(line.rstrip())
with open('file1') as file1, open('file2') as file2:
my_fun(chain(file1, file2))
This works because file objects are iterable, and chain() effectively concatenates one or more iterables.

Python - Comparing files delimiting characters in line

there.
I'm a begginer in python and I'm struggling to do the following:
I have a file like this (+10k line):
EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"
EgrG_000095800 /product="DNA polymerase epsilon subunit 3"
EgrG_000095850 /product="crossover junction endonuclease EME1"
EgrG_000095900 /product="lysine specific histone demethylase 1A"
EgrG_000096000 /product="charged multivesicular body protein 6"
EgrG_000096100 /product="NADH ubiquinone oxidoreductase subunit 10"
and this one (+600 lines):
EgrG_000076200.1
EgrG_000131300.1
EgrG_000524000.1
EgrG_000733100.1
EgrG_000781600.1
EgrG_000094950.1
All the ID's of the second file are in the first one,so I want the lines of the first file corresponding to ID's of the second one.
I wrote the following script:
f1 = open('egranulosus_v3_2014_05_27.tsv').readlines()
f2 = open('eg_es_final_ids').readlines()
fr = open('res.tsv','w')
for line in f1:
if line[0:14] == f2[0:14]:
fr.write('%s'%(line))
fr.close()
print "Done!"
My idea was to search the id's delimiting the characters on each line to match EgrG_XXXX of one file to the other, an then, write the lines to a new file.
I tried some modifications, that's just the "core" of my idea.
I got nothing. In one of the modifications, I got just one line.

I'd store the ids from f2 in a set and then check f1 against that.
id_set = set()
with open('eg_es_final_ids') as f2:
for line in f2:
id_set.add(line[:-2]) #get rid of the .1
with open('egranulosus_v3_2014_05_27.tsv') as f1:
with open('res.tsv', 'w') as fr:
for line in f1:
if line[:14] in id_set:
fr.write(line)

with open('egranulosus_v3_2014_05_27.txt', 'r') as infile:
line_storage = {}
for line in infile:
data = line.split()
key = data[0]
value = line.replace('\n', '')
line_storage[key] = value
with open('eg_es_final_ids.txt', 'r') as infile, open('my_output.txt', 'w') as outfile:
for line in infile:
lookup_key = line.split('.')[0]
match = line_storage.get(lookup_key)
outfile.write(''.join([str(match), '\n']))

f2 is a list of lines in file-2. Where are you iterating over the list, like you are doing for lines in file-1 (f1).
That seems to be the problem.

How to get the same part of file content?

I have two files , and I want to display the content exist in file 1 and file 2 in the screen. But it seems nothing display ( but it should display オレンジ) . What is the problem?
thanks
File 1
リンゴ
バナナ
オレンジ
File 2
オレンジ
Here is my code
import sys
File1 = open(sys.argv[1],"r",encoding="UTF-8")
F1_Content = File1.readlines()
F1_Content = map(lambda e : e.rstrip("\n"),F1_Content)
File2 = open(sys.argv[2],"r",encoding="UTF-8")
F2_Content = File2.readlines()
F2_Content = map(lambda e : e.rstrip("\n"),F2_Content)
for line in F1_Content:
print(repr(line))
if line in F2_Content:
print(line)
File1.close()
File2.close()
Output
'\ufeff
''
''

You probably have more whitespace in one of the files than just a newline. You could loop over either F1_Content and F2_Content, printing the representation of that line with print(repr(line)) or print(line.encode('unicode_escape')) to make it easier to spot how the lines differ.
I'd strip the lines entirely. Also, use a set for the lines of one file, testing will be much more efficient:
with open(sys.argv[1], "r") as file1:
f1_content = {line.strip() for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if line.strip() in file2:
print(line)
Looping directly over the file itself reads the file line by line, letting you handle file lines without having to read the whole file into memory.
Note also the use of with statements here; file objects are context managers, and when the context closes (the with block ends) the file object is automatically closed for you.
With Katakana, there is also the possibility that one of your files uses decomposition for the ZI character while the other does not; you can either express it as \u30B8 or as \u30B7\u3099; (SI + COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK):
>>> print('\u30B8 != \u30B7\u3099:', '\u30B8' != '\u30B7\u3099')
ジ != ジ: True
You can use unicodedata.normalize() to normalize all your lines to either composed or decomposed forms. Here I force all data to use composed forms:
from unicodedata import normalize
with open(sys.argv[1], "r") as file1:
f1_content = {normalize(line.strip(), 'NFKC') for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if normalize(line.strip(), 'NFKC') in file2:
print(line)

Python: Sorting a two files based on the order of one

I've been trying to do this task all day, and I really want to learn how to do it using Python. I want to take two tab-delimited files, one with an ID only and the other with the same ID and some description. I can easily merge these files on the shared ID field with unix join, but for that I need to sort both and I want to keep the ordering of the first file.
Ive tried some code below, and my method has been to try and add things to a tuple, as from my understanding, they will keep their order as you add to it. I havent been able to get anything to work though. Can anyone help?
Sample files:
file1 ->
111889
1437390
123
27998
2525778
12
1345
file2 ->
2525778'\t'item778
1345'\t'item110
123'\t'item1000
12'\t'item8889
111889'\t'item1111
1437390'\t'item222
27998'\t'item12
output ->
111889'\t'item1111
1437390'\t'item222
123'\t'item1000
27998'\t'item12
2525778'\t'item778
12'\t'item8889
1345'\t'item110
This what I have so far:
import sys
add_list = ()
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
for line2 in file2:
f1, f2, f3 = line2.split('\t')
#print f1, f2, f3
for row in file1:
#print row
if row != f1:
break
else:
add_list.append(f1,f2,'\n')
break

The key is to use Python dictionaries, they are perfect for this task…
Here is a complete answer:
import sys
# Each id is mapped to its item name
# (split() splits at whitespaces (including tabulation and newline), with no empty output strings):
items = dict(line.split() for line in open(sys.argv[2])) # Inspired by mgilson's answer
with open(sys.argv[1]) as ids:
for line in ids:
id = line.rstrip() # newline removed
print '{}\t{}'.format(id, items[id])
Here is the result:
% python out.py file1.txt file2.txt
111889 item1111
1437390 item222
123 item1000
27998 item12
2525778 item778
12 item8889
1345 item110
PS: Note that I did not open the files in rb mode, as there is no need to keep the original newline bytes, here, since we get rid of trailing newlines.

I would create a dictionary which maps the ID to the field value from the second file:
with open('file2') as fin:
d = dict(x.split(None, 1) for x in fin)
Then I would use the first file to construct the output in order from the dictionary:
with open('file1') as fin, open('output', 'w') as fout:
for line in fin:
key = line.strip()
fout.write('{key}\t{value}\n'.format(key=key, value=d[key])

out = {}
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
d2 = {}
for line in file2:
(key, val) = line.split('\t')
d2[key] = val
lines = file1.readlines()
out = { x:d2[x] for x in lines }
I am not sure about your sorting basis.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

comparing two text files and extracting text in python - python

Related

Comparing two lines from two text files according to a single part of the text file

-python - concatenate text file and store it into variable

Python - Comparing files delimiting characters in line

How to get the same part of file content?

Python: Sorting a two files based on the order of one

Categories

Resources