Python to compare huge text files based on multiple keys

Python to compare huge text files based on multiple keys - python

I have two text files each of approximately 1GB where each line has 60 columns in it.
There are 6 columns which are keys to compare in each file.
Example :
file1:
4|null|null|null|null|null|3590740374739|20077|7739662|75414741|
file2:
4|null|11|333|asdsd|null|3590740374739|20077|7739662|75414741|
Here two lines are equal as columns 7,8,9 and 10 are same in two files (keys).
I tried a sample to compare files without considering keys, which works fine, but I need to compare based on the keys, not character to character in each line.
Here is the code sample I worked to compare without considering keys.
matched = open('matchedrecords.txt','w')
with open('srcone.txt') as b:
blines = set(b)
with open('srctwo.txt') as a:
alines = set(a)
with open('notInfirstSource.txt', 'w') as result:
for line in alines:
if line not in blines:
result.write(line)
else:
matched.write(line)
with open('notInsecondSource.txt', 'w') as non:
for lin in blines:
if lin not in alines:
non.write(lin)
matched.close()

This is one of the way you can compare the lines based on keys/columns but I am not sure how efficient it is.
matched =open('matchedrecords.txt','w')
with open('srcone.txt') as b:
blines = set(b)
with open('srctwo.txt') as a:
alines= set(a)
# List of columns or keys to compare
list_of_columns_to_compare=[7,8,9]
a_columns=[]
b_columns=[]
for blin in blines :
for alin in alines:
for column_no in list_of_columns_to_compare :
# Appending columns to a list to compare
b_columns.append(blin.split('|')[column_no])
a_columns.append(alin.split('|')[column_no])
if a_columns == b_columns:
matched.write(blin + " = " + alin)

Taking a cue from a recipe for KeyedSets on ActiveState, you can build a set and then simply use set intersection and set difference to produce your results:
import collections
class Set(collections.Set):
#staticmethod
def key(s): return tuple(s.split('|')[6:10])
def __init__(self, it): self._dict = {self.key(s):s for s in it}
def __len__(self): return len(self._dict)
def __iter__(self): return self._dict.itervalues()
def __contains__(self, value): return self.key(value) in self._dict
data = {}
for filename in 'srcone.txt', 'srctwo.txt':
with open(filename) as f:
data[filename] = Set(f)
with open('notInFirstSource.txt', 'w') as f:
for lines in data['srctwo.txt'] - data['srcone.txt']:
f.write(''.join(lines))
with open('notInSecondSource.txt', 'w') as f:
for lines in data['srcone.txt'] - data['srctwo.txt']:
f.write(''.join(lines))
with open('matchedrecords.txt', 'w') as f:
for lines in data['srcone.txt'] & data['srctwo.txt']:
f.write(''.join(lines))

Finally ,i could achieve this in very less time using dictionaries.
i.e a 370 MB data compared to 270 MB data file in 50 Seconds Maximum(Using a tuple as a key ).
This is the script :
reader = open("fileA",'r')
reader2 = open("fileB",'r')
TmpDict ={}
TmpDict2={}
for line in reader:
line = line.strip()
TmpArr=line.split('|')
#Forming a dictionary with below columns as keys
TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
for line in reader2:
line = line.strip()
TmpArr=line.split('|')
TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
outfile = open('MatchedRecords.txt', 'w')
outfileNonMatchedB=open('notInB','w')
outfileNonMatchedA=open('notInA','w')
for k,v in TmpDict.iteritems():
if k in TmpDict2:
outfile.write(v+ '\n')
else:
outfileNonMatchedB.write(v+'\n')
outfile.close()
outfileNonMatchedB.close()
for k,v in TmpDict2.iteritems():
if k not in TmpDict:
outfileNonMatchedA.write(v+'\n')
outfileNonMatchedA.close()
Any improvements could be done to this ? Suggest me !
Thanks

Related

Making dictionary in dictionary to separate data by the same values in one column and then from second column

I am new in Python and I am stuck with one problem for a few days now. I made a script that:
-takes data from CSV file -sort it by same values in first column of data file
-instert sorted data in specifield line in different template text file
-save the file in as many copies as there are different values in first column from data file This picture below show how it works:
But there are two more things I need to do. When in separate files as showed above, there are some of the same values from second column of the data file, then this file should insert value from third column instead of repeating the same value from second column. On the picture below I showed how it should look like:
What I also need is to add somewhere separeted value of first column from data file by "_".
There is datafile:
111_0,3005,QWE
111_0,3006,SDE
111_0,3006,LFR
111_1,3005,QWE
111_1,5345,JTR
112_0,3103,JPP
112_0,3343,PDK
113_0,2137,TRE
113_0,2137,OMG
and there is code i made:
import shutil
with open("data.csv") as f:
contents = f.read()
contents = contents.splitlines()
values_per_baseline = dict()
for line in contents:
key = line.split(',')[0]
values = line.split(',')[1:]
if key not in values_per_baseline:
values_per_baseline[key] = []
values_per_baseline[key].append(values)
for file in values_per_baseline.keys():
x = 3
shutil.copyfile("of.txt", (f"of_%s.txt" % file))
filename = f"of_%s.txt" % file
for values in values_per_baseline[file]:
with open(filename, "r") as f:
contents = f.readlines()
contents.insert(x, ' o = ' + values[0] + '\n ' + 'a = ' + values[1] +'\n')
with open(filename, "w") as f:
contents = "".join(contents)
f.write(contents)
f.close()
I have been trying to make something like a dictionary of dictionaries of lists but I can't implement it in correct way to make it works. Any help or suggestion will be much appreciated.

You could try the following:
import csv
from collections import defaultdict
values_per_baseline = defaultdict(lambda: defaultdict(list))
with open("data.csv", "r") as file:
for key1, key2, value in csv.reader(file):
values_per_baseline[key1][key2].append(value)
x = 3
for filekey, content in values_per_baseline.items():
with open("of.txt", "r") as fin,\
open(f"of_{filekey}.txt", "w") as fout:
fout.writelines(next(fin) for _ in range(x))
for key, values in content.items():
fout.write(
f' o = {key}\n'
+ ' a = '
+ ' <COMMA> '.join(values)
+ '\n'
)
fout.writelines(fin)
The input-reading part is using the csv module from the standard library (for convenience) and a defaultdict. The file is read into a nested dictionary.

Content of datafile.csv:
111_0,3005,QWE
111_0,3006,SDE
111_0,3006,LFR
111_1,3005,QWE
111_1,5345,JTR
112_0,3103,JPP
112_0,3343,PDK
113_0,2137,TRE
113_0,2137,OMG
Possible solution is the following:
def nested_list_to_dict(lst):
result = {}
subgroup = {}
if all(len(l) == 3 for l in lst):
for first, second, third in lst:
result.setdefault(first, []).append((second, third))
for k, v in result.items():
for item1, item2 in v:
subgroup.setdefault(item1, []).append(item2.strip())
result[k] = subgroup
subgroup = {}
else:
print("Input data must have 3 items like '111_0,3005,QWE'")
return result
with open("datafile.csv", "r", encoding="utf-8") as f:
content = f.read().splitlines()
data = nested_list_to_dict([line.split(',') for line in content])
print(data)
# ... rest of your code ....
Prints
{'111_0': {'3005': ['QWE'], '3006': ['SDE', 'LFR']},
'111_1': {'3005': ['QWE'], '5345': ['JTR']},
'112_0': {'3103': ['JPP'], '3343': ['PDK']},
'113_0': {'2137': ['TRE', 'OMG']}}

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.

import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.

Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

How to check for reversed order string tuple and eliminate them from the file in python?

I want to remove reversed order string tuples from my large text file (>16M lines).
For example, if I have the following two lines in my file:
352_0F, 352_1F, 0.913
352_1F, 352_0F, 0.913
The expected output would be keep either of those lines (instead of both) as:
352_0F, 352_1F, 0.913
FYI: The third column col3 will be same for a tuple and its reversed order tuple.
I tried the following code, but it is not working as expected.
from collections import defaultdict
data = defaultdict(list)
with open("OUTPUT.txt","w") as output:
for fileName in ["Large_INPUT.txt"]:
with open(fileName,'r') as file1:
for line in file1:
col1,col2,value = line.split(",")
if (col1,col2) not in data:
if (col2,col1) not in data:
data[(col1,col2,value)]
output.write(f"{col1},{col2} {value}\n")
Can anybody please help me with this?

Seeing your code has a list of a single file, I am assuming you are generalizing it to work with multiple files. In that case you failed to mention something, do you want the combinations to persist across files? You are close with your implementation. Instead of using a dictionary to get O(1) searches you can use the simpler structure, set, and also get O(1) searching.
persistent over list of files
found_combinations = set()
with open("OUTPUT.txt", "w") as output:
for fileName in ["Large_INPUT.txt"]:
with open(fileName, 'r') as file1:
for line in file1:
cols = [col.strip() for col in line.strip().split(',')]
new_combination = frozenset(cols)
if new_combination not in found_combinations:
found_combinations.add(new_combination)
out = ', '.join(cols) + '\n'
output.write(out)
not persistent over files
with open("OUTPUT.txt", "w") as output:
for fileName in ["Large_INPUT.txt"]:
found_combinations = set()
with open(fileName, 'r') as file1:
for line in file1:
cols = [col.strip() for col in line.strip().split(',')]
new_combination = frozenset(cols)
if new_combination not in found_combinations:
found_combinations.add(new_combination)
out = ', '.join(cols) + '\n'
output.write(out)
Note that the only difference between the two versions is the placement of found_combinations = set()

How can I compare files quicker in Python?

Is there any way to make this script faster? I'm using one file to compare another file to print lines, if second column are equal.
import csv
output =[]
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
list1 = a.readlines()
reader1 = a.read()
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
list2 = b.readlines()
reader2 = b.read()
f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')
for line1 in list1:
separar = line1.split("\t")
gene = separar[2]
for line2 in list2:
separar2 = line2.split("\t")
gene2 = separar2[2]
if gene == gene2:
print line1
f3.write(line1)
Input example (for both files):
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
The command line below works equally for same purpose in bash:
awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf
How can I improve this Python script?

If you store your lines in dictionaries that are keyed by the column that you are interested in, you can easily use Python's built-in set functions (which run at C speed) to find the matching lines. I tested a slightly modified version of this (filenames changed, and changed split('\t') to split() because of stackoverflow formatting) and it seems to work fine:
import collections
# Use 'rb' to open files
infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'
def readfile(fname):
'''
Read in a file and return a dictionary of lines, keyed by the item in the second column
'''
results = collections.defaultdict(list)
# Read in binary mode -- it's quicker
with open(fname, 'rb') as f:
for line in f:
parts = line.split("\t")
if not parts:
continue
gene = parts[2]
results[gene].append(line)
return results
dict1 = readfile(infn1)
dict2 = readfile(infn2)
with open(outfn, 'wb') as outf:
# Find keys that appear in both files
for key in set(dict1) & set(dict2):
# For these keys, print all the matching
# lines in the first file
for line in dict1[key]:
print(line.rstrip())
outf.write(line)

Python: Sorting a two files based on the order of one

I've been trying to do this task all day, and I really want to learn how to do it using Python. I want to take two tab-delimited files, one with an ID only and the other with the same ID and some description. I can easily merge these files on the shared ID field with unix join, but for that I need to sort both and I want to keep the ordering of the first file.
Ive tried some code below, and my method has been to try and add things to a tuple, as from my understanding, they will keep their order as you add to it. I havent been able to get anything to work though. Can anyone help?
Sample files:
file1 ->
111889
1437390
123
27998
2525778
12
1345
file2 ->
2525778'\t'item778
1345'\t'item110
123'\t'item1000
12'\t'item8889
111889'\t'item1111
1437390'\t'item222
27998'\t'item12
output ->
111889'\t'item1111
1437390'\t'item222
123'\t'item1000
27998'\t'item12
2525778'\t'item778
12'\t'item8889
1345'\t'item110
This what I have so far:
import sys
add_list = ()
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
for line2 in file2:
f1, f2, f3 = line2.split('\t')
#print f1, f2, f3
for row in file1:
#print row
if row != f1:
break
else:
add_list.append(f1,f2,'\n')
break

The key is to use Python dictionaries, they are perfect for this task…
Here is a complete answer:
import sys
# Each id is mapped to its item name
# (split() splits at whitespaces (including tabulation and newline), with no empty output strings):
items = dict(line.split() for line in open(sys.argv[2])) # Inspired by mgilson's answer
with open(sys.argv[1]) as ids:
for line in ids:
id = line.rstrip() # newline removed
print '{}\t{}'.format(id, items[id])
Here is the result:
% python out.py file1.txt file2.txt
111889 item1111
1437390 item222
123 item1000
27998 item12
2525778 item778
12 item8889
1345 item110
PS: Note that I did not open the files in rb mode, as there is no need to keep the original newline bytes, here, since we get rid of trailing newlines.

I would create a dictionary which maps the ID to the field value from the second file:
with open('file2') as fin:
d = dict(x.split(None, 1) for x in fin)
Then I would use the first file to construct the output in order from the dictionary:
with open('file1') as fin, open('output', 'w') as fout:
for line in fin:
key = line.strip()
fout.write('{key}\t{value}\n'.format(key=key, value=d[key])

out = {}
with open(sys.argv[1], 'rb') as file1, open(sys.argv[2], 'rb') as file2:
d2 = {}
for line in file2:
(key, val) = line.split('\t')
d2[key] = val
lines = file1.readlines()
out = { x:d2[x] for x in lines }
I am not sure about your sorting basis.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python to compare huge text files based on multiple keys - python

Related

Making dictionary in dictionary to separate data by the same values in one column and then from second column

Use a file to search another file and print lines matching a pattern to first file

How to check for reversed order string tuple and eliminate them from the file in python?

How can I compare files quicker in Python?

Python: Sorting a two files based on the order of one

Categories

Resources