Intersecting regions of two files and printing combined result - python

I have two big files. I want to find common names in column 1 and column 2 of file1 and file2, respectively. The script below does it. Problem: I want to print also corresponding data from file1 in output, but it does not work. How to fix it?
file1.txt
GRMZM5G888627_P01 GO:0003674 molecular_function
GRMZM5G888620_P01 GO:0008150 biological_process
GRMZM5G888625_P03 GO:0008152 metabolic process
file2.txt
contig1 GRMZM5G888627_P01
contig2 AT2G41790.1
contig3 GRMZM5G888625_P03
Desired output,
contig1 GRMZM5G888627_P01 GO:0003674 molecular_function
contig3 GRMZM5G888625_P03 GO:0008152 metabolic process
Script,
f1=open('file1.txt','r')
f2=open('file2.txt','r')
output = open('result.txt','w')
dictA= dict()
for line1 in f1:
listA = line1.rstrip('\n').split('\t')
dictA[listA[0]] = listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()

Inside the
for line1 in f2:
listA isn't going to be mapped to the associated f2 line. You stored them in dictA.
Once you test if the subject is in dictA, you need to retrieve the proper listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
listA = dictA[subject]
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()
I don't understand why you are appending to new_list in here:
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
When processing the first line, you read in:
contig1 GRMZM5G888627_P01
Into new_list, giving you essentially:
new_list == ['contig1', 'GRMZM5G888627_P01']
Then you set query and subject to the two items in the list. Then append them back onto it, giving you:
new_list == ['contig1', 'GRMZM5G888627_P01', 'contig1', 'GRMZM5G888627_P01']
Which you never use. You should be able to just have:
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.writelines(new_list[0] + '\t' + subject + '\t' + str(listA[1]) + str(listA[2]) + '\n')
output.close()
Also you are only writing 1 line, so output.write is fine. And string addition is usually bad, so replaced by format. Your listA stored strings, so I eliminated the str() call.
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.write("{}\t{}\t{}{}\n".format(new_list[0], subject, listA[1], listA[2])
output.close()

try this,
ins = open('file1.txt', "r" )
values=''
dict={}
for line in ins:
arrayline=line.split()
dict[arrayline[0]]='\t'.join(arrayline)
file2=open('file2.txt', "r" )
output = open('result.txt','w')
for line in file2:
array2=line.split()
try:
v=dict[array2[1]]
output.write('\n'+array2[0]+'\t'+v)
except:
pass
output.close()

use sets
In [1]: list1=[1,2,3,4,5,6,7,8,9]
In [2]: list2=[1,2,3,10,11,12,13]
In [3]: list1=set(list1)
In [4]: list1.intersection(list2)
Out[4]: {1, 2, 3}

Related

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")

How can I compare files quicker in Python?

Is there any way to make this script faster? I'm using one file to compare another file to print lines, if second column are equal.
import csv
output =[]
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
list1 = a.readlines()
reader1 = a.read()
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
list2 = b.readlines()
reader2 = b.read()
f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')
for line1 in list1:
separar = line1.split("\t")
gene = separar[2]
for line2 in list2:
separar2 = line2.split("\t")
gene2 = separar2[2]
if gene == gene2:
print line1
f3.write(line1)
Input example (for both files):
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
The command line below works equally for same purpose in bash:
awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf
How can I improve this Python script?
If you store your lines in dictionaries that are keyed by the column that you are interested in, you can easily use Python's built-in set functions (which run at C speed) to find the matching lines. I tested a slightly modified version of this (filenames changed, and changed split('\t') to split() because of stackoverflow formatting) and it seems to work fine:
import collections
# Use 'rb' to open files
infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'
def readfile(fname):
'''
Read in a file and return a dictionary of lines, keyed by the item in the second column
'''
results = collections.defaultdict(list)
# Read in binary mode -- it's quicker
with open(fname, 'rb') as f:
for line in f:
parts = line.split("\t")
if not parts:
continue
gene = parts[2]
results[gene].append(line)
return results
dict1 = readfile(infn1)
dict2 = readfile(infn2)
with open(outfn, 'wb') as outf:
# Find keys that appear in both files
for key in set(dict1) & set(dict2):
# For these keys, print all the matching
# lines in the first file
for line in dict1[key]:
print(line.rstrip())
outf.write(line)

Merge 3 Textfiles with python

Im really new to programming and couldn´t find a satisfying answer so far. Im using python and I want to merge three textfiles receive all possible word combinations. I have 3 files:
First file:
line1
line2
line3
Second file(prefix):
pretext1
pretext2
pretext3
Third file(suffix):
suftext1
suftext2
suftext3
I already used .read() and have my variables containing the list for each textfile. Now I want to write a function to merge this 3 files to 1 and it should look like this:
outputfile:
pretext1 line1 suftext1 #this is ONE line(str)
pretext2 line1 suftext1
pretext3 line1 suftext1
pretext1 line1 suftext2
pretext1 line1 suftext3
and so on, you get the idea
I want all possible combinations in 1 textfile as output. I guess I have to use a loop within a loop?!
Here it is, if I got your question right.
First you have to focus into the correct folder with the os package.
import os
os.chdir("The_path_of_the_folder_containing_the_files")
Then you open you three files, and put the words into lists:
file_1 = open("file_1.txt")
file_1 = file_1.read()
file_1 = file_1.split("\n")
file_2 = open("file_2.txt")
file_2 = file_2.read()
file_2 = file_2.split("\n")
file_3 = open("file_3.txt")
file_3 = file_3.read()
file_3 = file_3.split("\n")
You create the text you want in your output file with loops:
text_output = ""
for i in range(len(file_2)):
for j in range(len(file_1)):
for k in range(len(file_3)):
text_output += file_2[i] + " " + file_1[j] + " " + file_3 [k] + "\n"
And you enter that text into your output file (if that file does not exist, it will be created).
file_output = open("file_output.txt","w")
file_output.write(text_output)
file_output.close()
While the existing answer may be correct, I think this is a case where bringing in a library function is definitely the way to go.
import itertools
with open('lines.txt') as line_file, open('pretext.txt') as prefix_file, open('suftext.txt') as suffix_file:
lines = [l.strip() for l in line_file.readlines()]
prefixes = [p.strip() for p in prefix_file.readlines()]
suffixes = [s.strip() for s in suffix_file.readlines()]
combos = [('%s %s %s' % (x[1], x[0], x[2]))
for x in itertools.product(lines, prefixes, suffixes)]
for c in combos:
print c

Finding common IDs in different .txt files and appending additional corrisponding lines

I am trying to find common IDs present in two files and print out the result into a new file appending the additional lines corresponding to those IDs.How can I do this?
Input file1.txt
F775_23607 EMT15298 GO:0003674 molecular_function PF08268 345
F775_00510 EMT20601 GO:0005515 protein binding PF08268 456
F775_00510 EMT23774 GO:0003674 molecular_function PF00646 134
F775_00510 EMT23774 GO:0005515 protein binding PF03106 888
F775_23182 EMT33502 GO:0003677 DNA binding PF03106 789
Input file2.txt
contig15 EMT15298 95.27 148
contig18 EMT04099 97.95 293
contig18 EMT20601 92.83 293
contig18 EMT23062 93.17 293
Desired output file (I want to be able to decide which lines to print and which not)
EMT15298 GO:0003674 molecular_function PF08268
EMT20601 GO:0005515 protein binding PF08268
My script (which, basically, prints only the column, which is in common)
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("results.txt",'w')
fileA.next()
setA = set()
for line1 in fileA:
listA = line1.split('\t')
setA.add(listA[1])
setB = set()
for line1 in fileB:
listB = line1.split('\t')
setB.add(listB[1])
for key in setA & setB:
output.writelines(key+'\n')
Since your first text file contains all of the "fields" for the output we can reduce the logic and number of steps slightly.
First we open the two input files and read them into lists:
with open('file1.txt', 'r') as a, open('file2.txt','r') as b:
fileA = [l.rstrip('\n').split('\t')[1:5] for l in a.readlines()]
fileB = [l.rstrip('\n').split('\t')[1:] for l in b.readlines()]
So now we have two lists, fileA and fileB. You'll notice the slice notation on both of them. Since fileA has all of the values you want for the output it is now ready, it just needs filtered against the second list. I've also removed the first item from both lists so we can use the EMT... values for comparison.
Now we can check if fileB contains (not in it's entirety) fileA and write the matches to the results file:
with open('results.txt','w') as o:
for line in fileA:
if any(line[0] in l for l in fileB):
o.write('%s\n' % '\t'.join(line))
results.txt is once again tab-delimited with the corresponding matches:
EMT15298 GO:0003674 molecular_function PF08268
EMT20601 GO:0005515 protein binding PF08268
You can use dicts instead of sets:
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("results.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in set(dictA).intersection(dictB):
output.write(dictA[key][1] + '\t' + dictA[key][2] + '\t' + dictA[key][3] + '\t' + dictA[key][4] + '\n')
If you just want to do a "join" operation you can use unix join command specifying second column, for a tab delimited file it would be just like:
join file1.txt file2.txt -j2
You need to have the rows sorted, otherwise it will not work, however you can also use the sort command also available.
In addition, to select the columns you want to use you can use a pipe to the cut function:
join file1.txt file2.txt -j2 | cut -f2,3,4,5

Compare two different files line by line in python

I have two different files and I want to compare theirs contents line by line, and write their common contents in a different file. Note that both of them contain some blank spaces.
Here is my pseudo code:
file1 = open('some_file_1.txt', 'r')
file2 = open('some_file_2.txt', 'r')
FO = open('some_output_file.txt', 'w')
for line1 in file1:
for line2 in file2:
if line1 == line2:
FO.write("%s\n" %(line1))
FO.close()
file1.close()
file2.close()
However, by doing this, I got lots of blank spaces in my FO file. Seems like common blank spaces are also written. I want to write only the text part. Can somebody please help me.
For example: my first file (file1) contains data:
Config:
Hostname = TUVALU
BT:
TS_Ball_Update_Threshold = 0.2
BT:
TS_Player_Search_Radius = 4
BT:
Ball_Template_Update = 0
while second file (file2) contains data:
Pole_ID = 2
Width = 1280
Height = 1024
Color_Mode = 0
Sensor_Scale = 1
Tracking_ROI_Size = 4
Ball_Template_Update = 0
If you notice, last two lines of each files are the same, hence, I want to write this file in my FO file. But, the problem with my approach is that, it writes the common blank space also. Should I use regex for this problem? I do not have experience with regex.
This solution reads both files in one pass, excludes blank lines, and prints common lines regardless of their position in the file:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
Yet another example...
from __future__ import print_function #Only for Python2
with open('file1.txt') as f1, open('file2.txt') as f2, open('outfile.txt', 'w') as outfile:
for line1, line2 in zip(f1, f2):
if line1 == line2:
print(line1, end='', file=outfile)
And if you want to eliminate common blank lines, just change the if statement to:
if line1.strip() and line1 == line2:
.strip() removes all leading and trailing whitespace, so if that's all that's on a line, it will become an empty string "", which is considered false.
If you are specifically looking for getting the difference between two files, then this might help:
with open('first_file', 'r') as file1:
with open('second_file', 'r') as file2:
difference = set(file1).difference(file2)
difference.discard('\n')
with open('diff.txt', 'w') as file_out:
for line in difference:
file_out.write(line)
If order is preserved between files you might also prefer difflib. Although Robᵩ's result is the bona-fide standard for intersections you might actually be looking for a rough diff-like:
from difflib import Differ
with open('cfg1.txt') as f1, open('cfg2.txt') as f2:
differ = Differ()
for line in differ.compare(f1.readlines(), f2.readlines()):
if line.startswith(" "):
print(line[2:], end="")
That said, this has a different behaviour to what you asked for (order is important) even though in this instance the same output is produced.
Once the file object is iterated, it is exausted.
>>> f = open('1.txt', 'w')
>>> f.write('1\n2\n3\n')
>>> f.close()
>>> f = open('1.txt', 'r')
>>> for line in f: print line
...
1
2
3
# exausted, another iteration does not produce anything.
>>> for line in f: print line
...
>>>
Use file.seek (or close/open the file) to rewind the file:
>>> f.seek(0)
>>> for line in f: print line
...
1
2
3
Try this:
from __future__ import with_statement
filename1 = "G:\\test1.TXT"
filename2 = "G:\\test2.TXT"
with open(filename1) as f1:
with open(filename2) as f2:
file1list = f1.read().splitlines()
file2list = f2.read().splitlines()
list1length = len(file1list)
list2length = len(file2list)
if list1length == list2length:
for index in range(len(file1list)):
if file1list[index] == file2list[index]:
print file1list[index] + "==" + file2list[index]
else:
print file1list[index] + "!=" + file2list[index]+" Not-Equel"
else:
print "difference inthe size of the file and number of lines"
I have just been faced with the same challenge, but I thought "Why programming this in Python if you can solve it with a simple "grep"?, which led to the following Python code:
import subprocess
from subprocess import PIPE
try:
output1, errors1 = subprocess.Popen(["c:\\cygwin\\bin\\grep", "-Fvf" ,"c:\\file1.txt", "c:\\file2.txt"], shell=True, stdout=PIPE, stderr=PIPE).communicate();
output2, errors2 = subprocess.Popen(["c:\\cygwin\\bin\\grep", "-Fvf" ,"c:\\file2.txt", "c:\\file1.txt"], shell=True, stdout=PIPE, stderr=PIPE).communicate();
if (len(output1) + len(output2) + len(errors1) + len(errors2) > 0):
print ("Compare result : There are differences:");
if (len(output1) + len(output2) > 0):
print (" Output differences : ");
print (output1);
print (output2);
if (len(errors1) + len(errors2) > 0):
print (" Errors : ");
print (errors1);
print (errors2);
else:
print ("Compare result : Both files are equal");
except Exception as ex:
print("Compare result : Exception during comparison");
print(ex);
raise;
The trick behind this is the following:
grep -Fvf file1.txt file2.txt verifies if all entries in file2.txt are present in file1.txt. By doing this in both directions we can see if the content of both files are "equal". I put "equal" between quotes because duplicate lines are disregarded in this way of working.
Obviously, this is just an example: you can replace grep by any commandline file comparison tool.
difflib is well worth the effort, with nice condensed output.
from pathlib import Path
import difflib
mypath = '/Users/x/lib/python3'
file17c = Path(mypath, 'oop17c.py')
file18c = Path(mypath, 'oop18c.py')
with open(file17c) as file_1:
file1 = file_1.readlines()
with open(file18c) as file_2:
file2 = file_2.readlines()
for line in difflib.unified_diff(
file1, file2, fromfile=str(file17c), tofile=str(file18c), lineterm=''):
print(line)
output
+ ... unique stuff present in file18c
- ... stuff absent in file18c but present in file17c

Categories