Efficiency of comparing two dictionaries - python

The following program has been running for about ~22 hours on two files (txt, ~10MB ea.). Each file has about ~100K rows. Can someone give me an indication of how inefficient my code is and perhaps a faster method. The input dict are ordered and preserving order is necessary:
import collections
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
Su = {}
with open ('Sucrose_rivacombined.txt') as f:
for line in f:
(key, val) = line.split('\t')
Su[(key)] = val
Su_OD = collections.OrderedDict(Su)
Su_keys = Su_OD.keys()
Et = {}
with open ('Ethanol_rivacombined.txt') as g:
for line in g:
(key, val) = line.split('\t')
Et[(key)] = val
Et_OD = collections.OrderedDict(Et)
Et_keys = Et_OD.keys()
merged_keys = Su_keys + Et_keys
merged_keys = uniq(merged_keys)
d3=collections.OrderedDict()
output_doc = open("compare.txt","w+")
for chr_local in merged_keys:
line_output = chr_local
if (Et.has_key(chr_local)):
line_output = line_output + "\t" + Et[chr_local]
else:
line_output = line_output + "\t" + "ND"
if (Su.has_key(chr_local)):
line_output = line_output + "\t" + Su[chr_local]
else:
line_output = line_output + "\t" + "ND"
output_doc.write(line_output + "\n")
The input files are as follows: not every key is present in both files
Su:
chr1:3266359 80.64516129
chr1:3409983 100
chr1:3837894 75.70093458
chr1:3967565 100
chr1:3977957 100
Et:
chr1:3266359 95
chr1:3456683 78
chr1:3837894 54.93395855
chr1:3967565 100
chr1:3976722 23
I would like the output to look as follows:
chr1:3266359 80.645 95
chr1:3456683 ND 78

Replace uniq with this as the inputs are hashable:
def uniq(input):
output = []
s = set()
for x in input:
if x not in s:
output.append(x)
s.add(x)
return output
This will reduce a nearly O(n^2) process to nearly O(n).

You don't need your unique function.
pseudo code like:
read file 2 as OrderedDict
process file 1 writing out it's item (already ordered correctly)
pop, with defalut from file 2 for last part of the output line
after file one is consumed process the Ordered dict from file 2
Also, love list comprehensions...you can read the file with:
OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
Only one ordered dict and 'Sucrose_rivacombined.txt' never even makes it into memory. should be super fast
EDIT complete code (not sure about your output line format)
from collections import OrderedDict
Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
with open("compare.txt","w+") as output_doc:
for line in open('Sucrose_rivacombined.txt'):
key,val = line.strip().split('\t')
line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
output_doc.write(line_out+'\n')
for key,val in Et_OD.items():
line_out = '\t'.join((key,'ND',val))
output_doc.write(line_out+'\n')

your output is a list, but your input is dictionaries: their keys are guaranteed unique, but your not in output is going to need to compare against every element of the list, which is combinatorial. (You're doing n^2 comparisons because of that not check.)
You could probably completely replace uniq with:
Su_OD.update(Et_OD)
This works for me:
from collections import OrderedDict
one = OrderedDict()
two = OrderedDict()
one['hello'] = 'one'
one['world'] = 'one'
two['world'] = 'two'
two['cats'] = 'two'
one.update(two)
for k, v in one.iteritems():
print k, v
output:
hello one
world two
cats two

Related

Find specific values in a txt file and adding them up with python

I have a txt file which looks like that:
[Chapter.Title1]
Irrevelent=90 B
Volt=0.10 ienl
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=999
Irrevelent=999
[Chapter.Title3]
Irrevelent=92 B
Volt=0.20 ienl
Watt=5 W
Ampere=6 A
Irrevelent=93 C
What I want is that it catches "Title1" and the values "0,1", "2" and "3". Then adds them up (which would be 5.1).
I don't care about the lines with "irrevelent" at the beginning.
And then the same with the third block. Catching "Title3" and adding "0.2", "5" and "6".
The second block with "Title2" does not contain "Volt", Watt" and "Ampere" and is therefore not relevant.
Can anyone please help me out with this?
Thank you and cheers
You can use regular expressions to get the values and the titles in lists, then use them.
txt = """[Chapter.Title1]
Irrevelent=90 B
Volt=1 V
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=92 B
Volt=4 V
Watt=5 W
Ampere=6 A
Irrevelent=93 C"""
#that's just the text
import re
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=(\d+)'
rxv2=r'Watt=(\d+)'
rxv3=r'Ampere=(\d+)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
print(res1)
print(resv1)
print(resv2)
print(resv3)
Here you get the titles and the interesting values you want :
['Title1', 'Title2']
['1', '4']
['2', '5']
['3', '6']
You can then use them as you want, for example :
for title_index in range(len(res1)):
print(res1[title_index])
value=int(resv1[title_index])+int(resv2[title_index])+int(resv3[title_index])
#use float() instead of int() if you have non integer values
print("the value is:", value)
You get :
Title1
the value is: 6
Title2
the value is: 15
Or you can store them in a dictionary or an other structure, for example :
#dict(zip(keys, values))
data= dict(zip(res1, [int(resv1[i])+int(resv2[i])+int(resv3[i]) for i in range(len(res1))] ))
print(data)
You get :
{'Title1': 6, 'Title2': 15}
Edit : added opening of the file
import re
with open('filename.txt', 'r') as file:
txt = file.read()
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
data= dict(zip(res1, [float(resv1[i])+float(resv2[i])+float(resv3[i]) for i in range(len(res1))] ))
print(data)
Edit 2 : ignoring missing values
import re
with open('filename.txt', 'r') as file:
txt = file.read()
#divide the text into parts starting with "chapter"
substr = "Chapter"
chunks_idex = [_.start() for _ in re.finditer(substr, txt)]
chunks = [txt[chunks_idex[i]:chunks_idex[i+1]-1] for i in range(len(chunks_idex)-1)]
chunks.append(txt[chunks_idex[-1]:]) #add the last chunk
#print(chunks)
keys=[]
values=[]
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
for chunk in chunks:
res1 = re.findall(rx1, chunk)
resv1 = re.findall(rxv1, chunk)
resv2 = re.findall(rxv2, chunk)
resv3 = re.findall(rxv3, chunk)
# check if we can find all of them by checking if the lists are not empty
if res1 and resv1 and resv2 and resv3 :
keys.append(res1[0])
values.append(float(resv1[0])+float(resv2[0])+float(resv3[0]))
data= dict(zip(keys, values ))
print(data)
Here's a quick and dirty way to do this, reading line by line, if the input file is predictable enough.
In the example I just print out the titles and the values; you can of course process them however you want.
f = open('file.dat','r')
for line in f.readlines():
## Catch the title of the line:
if '[Chapter' in line:
print(line[9:-2])
## catch the values of Volt, Watt, Amere parameters
elif line[:4] in ['Volt','Watt','Ampe']:
value = line[line.index('=')+1:line.index(' ')]
print(value)
## if line is "Irrelevant", or blank, do nothing
f.close()
There are many ways to achieve this. Here's one:
d = dict()
V = {'Volt', 'Watt', 'Ampere'}
with open('chapter.txt', encoding='utf-8') as f:
key = None
for line in f:
if line.startswith('[Chapter'):
d[key := line.strip()] = 0
elif key and len(t := line.split('=')) > 1 and t[0] in V:
d[key] += float(t[1].split()[0])
for k, v in d.items():
if v > 0:
print(f'Total for {k} = {v}')
Output:
Total for [Chapter.Title1] = 6
Total for [Chapter.Title2] = 15

CSV Python Outputting: Outputting non-matching field once rather than once for every item in list

I've been trying to figure this out for about a year now and I'm really burnt out on it so please excuse me if this explanation is a bit rough.
I cannot include job data, but it would be accurate to imagine 2 csv files both with the first column populated with values (Serial numbers/phone numbers/names, doesn't matter - just values). Between both csv files, some values would match while other values would only be contained in one or the other (Timmy is in both files and is a match, Robert is only in file 1 and does not match any name in file 2).
I can successfully output a csv value ONCE that exists in the both csv files (I.e. both files contain "Value78", output file will contain "Value78" only once).
When I try to tack on an else statement to my if condition, to handle non-matching items, the program will output 1 entry for every item it does not match with (makes 100% sense, matches happen once but every other comparison result besides the match is a non-match).
I cannot envision a structure or method to hold the fields that don't match back so that they can be output once and not overrun my terminal or output file.
My goal is to output two csv files, matches and non-matches, with the non-matches having only one entry per value.
Anyways, onto the code:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
pass
The "else: pass" is where the logic of filtering out non-matches is escaping me. Outputting from this else statement will write the non-matching value (len(VList) - 1) times for an iteration that DOES produce 1 match, the entire len(VList) for an iteration with no match. I've tried using a counter and only outputting if the counter equals the len(VList), (incrementing in the else statement, writing output under the scope of the second for loop), but received the same output as if I tried outputting non-matches.
Below is one way you might go about deduplicating and then writing to a file:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
list_of_non_matches = []
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
list_of_non_matches.append(MyList[x][0])
# Remove duplicates from the non matches
new_list = []
[new_list.append(x) for x in list_of_non_matches if x not in new_list]
# Write the new list to a file
for i in new_list:
NFile.write(i + '\n')
Does this work?
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,'r') as MFile,
(VENDORUNITS,'r') as VFile,
(MATCHES,'w') as OFile,
(NONMATCHES,mode,'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
MyVals = [x for x in MyList]
MyVals = [x[0] for x in MyVals]
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
vVals = [x for x in VList]
vVals = [x[0] for x in vVals]
for val in MyVals:
if val in vVals:
OFile.write(Val + '\n')
else:
NFile.write(Val + '\n')
#for x in range(len(MyList)):
# for y in range(len(VList)):
# if str(MyList[x][0]) == str(VList[y][0]):
# OFile.write(MyList[x][0] + '\n')
# else:
# pass
Sorry, I had some issues with my PC. I was able to solve my own question the night I posted. The solution I used is so simple I'm kicking myself for not figuring it out way sooner:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
tmpStr = ''
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
tmpStr = '' #Sets to blank so comparison fails, works because break
OFile.write(MyList[x][0] + '\n')
break
else:
tmp = str(MyList[x][0])
if tmp != '':
NFile.write(tmp + '\n')

Python; print filename and header

I have files (fasta files with a sequence) that look like this:
File1.fasta
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
File2.fasta
>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
With my script, I count all the 5-mers in these files. My code is as follows:
import operator
import glob
def printSeq(name, seq):
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
for name in glob.glob('*.fasta'):
with open(name, 'r') as f:
seq = ""
key = ""
for line in f.readlines():
if line.startswith(">"):
if key and seq:
printSeq(key, seq)
key = line[1:].strip()
seq = ""
else:
seq += line.strip()
printSeq(key, seq)
The output is now the 5-mer followed with the count.
I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:
File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1
How can I achieve that?
Additional question
I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows
from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
for record in SeqIO.parse(fasta_file, "fasta"):
reverse_complement = ">" + record.id + "\n" + record.seq.reverse_complement()
So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?
My reverse_complement data is
File1.fasta (reverse_complement)
>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT
This could also be done using a Counter() as follows:
from collections import Counter
from itertools import groupby
import glob
for fasta_file in glob.glob('*.fasta'):
basename = os.path.splitext(os.path.basename(fasta_file))[0]
with open(fasta_file) as f_fasta:
for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
if k:
sequence = next(g).strip('>\n')
else:
d = list(''.join(line.strip() for line in g))
counts = Counter()
while len(d) >= 5:
five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
counts[five_mer] += 1
del d[0]
for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
print "{} {} {} {}".format(basename, sequence, five_mer, count)
This would give you output with the largest counts first and then alphabetically:
File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2
It uses Python's groupby() function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k is the result of the startswith() call. So when k is False, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.
It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter(). It then removes the first character from the list and repeats until there are less than 5 characters remaining.
For just alphabetical ordering:
for five_mer, count in sorted(counts.items()):
A Counter() works the same way as a dictionary, so .items() would give a list of key value pairs. These are sorted before being displayed.
You change the signature of
def printSeq(name, seq)
to
def printSeq(file, header, name, seq):
incorporate the new variables in the print statements.
e.g.
print (item[0] + "\t" + str(item[1]))
v
print (file + "\t" + header + "\t" + item[0] + "\t" + str(item[1]))
Then, in your loop you pass the information to this function.
You have the file name available in the loop, stored in the variable name
You parse the header in the lines where you detect it, and store it in a variable for later use. The later use is when you call the printSeq-function

How do you sort by ascending order from a txt file in python?

I am new to python and have a hopefully simple question. I have a txt file with road names and lengths. i.e. Main street 56 miles. I am stumped on how to call on that file in Python and sort all the road lengths in ascending order. Thanks for your help.
Let's assume that there is a "miles" after every number. (This is an untested code so you can get it edited but I think the idea is right).
EDIT: THIS IS TESTED
import collections
originaldict = {}
newdict = collections.OrderedDict()
def isnum(string):
try:
if string is ".":
return True
float(string)
return True
except Exception:
return False
for line in open(input_file, "r"):
string = line[:line.find("miles") - 1]
print string
curnum = ""
for c in reversed(string):
if not isnum(c):
break
curnum = c + curnum
originaldict[float(curnum)] = []
originaldict[float(curnum)].append(line)
for num in sorted(originaldict.iterkeys()):
newdict[num] = []
newdict[num].append(originaldict[num][0])
del originaldict[num][0]
with open(output_file, "a") as o:
for value in newdict.values():
for line in value:
o.write(line + "\n")

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))
Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result
Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

Categories