Performance task in Python - python

We have IT olimpics in my country. Normally they are written in Java, C or C++. I gues for a year or so they also include other languages like python.
I tried to solve a task from previous years in Python called Letters and I'm constantly failing. The task is to write a code that counts minimum number of shifts between neighboring letters to turn one string into another.
As input you get number of letters in one string and two strings with same amount of letters but in different order. Lenght of one string is from 2 to 1 000 000 letters. There are only capital letters, they can but don't have to be sorted and can repeat.
Here's an example:
7
AABCDDD
DDDBCAA
Correct output should be 16
As output you have to return single value which is minimum number of shifts. It has to calculate output under 5seconds.
I made it calculate correct output, but in longer strings (lik 800 000 letters) it starts to slow down. The longest inputs return value in about 30 seconds. There's also one input counting 900 000 letters per word that calculates 30 minutes!
Under link you can find all input files for tests:
https://oi.edu.pl/l/19oi_ksiazeczka/
Click on this link to download files for "Letters" task:
XIX OI testy i rozwiązania - zad. LIT (I etap) (3.5 MB)
Bellow is my code. How can I speed it up?
# import time
import sys
# start = time.time()
def file_reader():
standard_input=""
try:
data = sys.stdin.readlines()
for line in data:
standard_input+=line
except:
print("An exception occurred")
return standard_input
def mergeSortInversions(arr):
if len(arr) == 1:
return arr, 0
else:
a = arr[:len(arr)//2]
b = arr[len(arr)//2:]
a, ai = mergeSortInversions(a)
b, bi = mergeSortInversions(b)
c = []
i = 0
j = 0
inversions = 0 + ai + bi
while i < len(a) and j < len(b):
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
inversions += (len(a)-i)
c += a[i:]
c += b[j:]
return c, inversions
def literki():
words=(file_reader()).replace("\n", "")
number = int("".join((map(str, ([int(i) for i in list(words) if i.isdigit()])))))
all_letters = [x for x in list(words) if x not in str(number)]
name = all_letters[:number]
anagram = all_letters[number:]
p=[]
index=list(range(len(anagram)))
anagram_dict = {index[i]: anagram[i] for i in range(len(index))}
new_dict = {}
anagram_counts={}
for key, value in anagram_dict.items():
if value in new_dict:
new_dict[value].append(key)
else:
new_dict[value]=[key]
for i in new_dict:
anagram_counts.update({i:new_dict[i]})
for letter in name:
a=anagram_counts[letter]
p.append(a.pop(0))
print(mergeSortInversions(p)[1])
#>>
literki()
# end = time.time()
# print(start-end)
So to explain what it does in parts: File_reader: simply reads an input file from standard input. mergeSortInversions(arr): normally it would sort a string, but here I wanted it to return sum of inversions. I'm not that smart to figure it out by myself, I found it on web but it does the job. Unfortunatelly, for 1mln strings it does that in 10 secondes or so. In "literki" function: first, I've devided input to have number of signs and two, even in lenght words as lists.
Then, I've made something similar in function to stacks array (not shure if it is called this way in english). basically I made a dictionary with every letter as key and indexes of those letters as a list in values (if a letter occurs more than once, value would contain a list of all indexes for that letter). Last thing I did before "the slow thing", for every letter in "name" variable I've extracted coresponding index. Up to that point all opertations for every input, ware taking arround 2 secconds. And now two lines that generate the rest of time for calculating outcome: - I append the index to p=[] list and in the same time pop it from list in dictionary, so it wouldn't read it again for another same letter. - I calculate number of moves (inversions) with mergeSortInversions(arr) based on p=[...] list and print it as output.
I know that poping from bottom is slow but on the other hand I would have to create lists of indexes from bottom (so I could pop index from top) but that took even longer. I've also tried converting a=[... ] with deque but it also was to slow.

I think I'd try a genetic algorithm for this problem. GA's don't always come up with an optimal solution, but they are very good for getting an acceptable solution in a reasonable amount of time. And for small inputs, they can be optimal.
The gist is to come up with:
1) A fitness function that assigns a number indicating how good a particular candidate solution is
2) A sexual reproduction function, that combines, in a simple way, part of two candidate solutions
3) A mutation function, that introduces one small change to a candidate solution.
So you just let those functions go to town, creating solution after solution, and keeping the best ones - not the best one, the best ones.
Then after a while, the best solution found is your answer.
Here's an example of using a GA for another hard problem, called The House Robber Problem. It's in Python:
http://stromberg.dnsalias.org/~strombrg/house-robber-problem/

Related

Need to reduce the number of recursive calls in this function

The problem is given a string S and an integer k<len(S) we need to find the highest string in dictionary order with any k characters removed but maintaining relative ordering of string.
This is what I have so far:
def allPossibleCombinations(k,s,strings):
if k == 0:
strings.append(s)
return strings
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings)
return strings
def stringReduction(k, s):
strings = []
combs = allPossibleCombinations(k,s, strings)
return sorted(combs)[-1]
This is working for a few test cases but it says that I have too many recursive calls for other testcases. I don't know the testcases.
This should get you started -
from itertools import combinations
def all_possible_combinations(k = 0, s = ""):
yield from combinations(s, len(s) - k)
Now for a given k=2, and s="abcde", we show all combinations of s with k characters removed -
for c in all_possible_combinations(2, "abcde"):
print("".join(c))
# abc
# abd
# abe
# acd
# ace
# ade
# bcd
# bce
# bde
# cde
it says that I have too many recursive calls for other testcases
I'm surprised that it failed on recursive calls before it failed on taking too long to come up with an answer. The recursion depth is the same as k, so k would have had to reach 1000 for default Python to choke on it. However, your code takes 4 minutes to solve what appears to be a simple example:
print(stringReduction(8, "dermosynovitis"))
The amount of time is a function of k and string length. The problem as I see it, recursively, is this code:
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings, depth + 1)
Once we've removed the first character say, and done all the combinations without it, there's nothing stopping the recursive call that drops out the second character from again removing the first character and trying all the combinations. We're (re)testing too many strings!
The basic problem is that you need to prune (i.e. avoid) strings as you test, rather than generate all possibilties and test them. If a candidate's first letter is less than that of the best string you've seen so far, no manipulation of the remaining characters in that candidate is going to improve it.

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)
If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]
The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Find string matches count in list using Python

I read a text file for some analysis, each word is appended to a list and given an id
#!/usr/bin/python3
with fi as myfile:
for line in myfile:
for item in line.split(' '):
db[0].append(id_+1)
db[2].append(item)
...more stuff
Then I search for each word through the list to find its matches, and store the count as sim1. If a match is found, I test if the next word matches the consecutive one as well, and store its count as sim2. Similarly for sim3. My code looks like:
for i in range(id_-3):
sim1=0
sim2=0
sim3=0
for j in range(id_-3):
if i==j: continue;
if db[2][i] == db[2][j]:
sim1 += 1
if db[2][i+1] == db[2][j+1]:
sim2 += 1
if db[2][i+2] == db[2][j+2]:
sim3 += 1
db[3].append(sim1)
db[4].append(sim2)
db[5].append(sim3)
This works, but it's too slow!
I believe python provides faster search methods, but I'm still a Py newbie!
The slowness in your algorithm mainly comes from the fact that you have an inner loop which iterates len(db[2]) times contained within an outer loop which also iterates len(db[2]) times. This means the inner code is executing len(db[2])^2 times. If your file is large and you are parsing 5000 words, for example, then the code runs 5000^2 = 25,000,000 times!
So, the angle of attack to solve the problem is to find a way to eliminate or significantly reduce the cost of that inner loop. Below is an example solution which only needs to iterate through len(db[2]) one time, and then does a second separate loop which iterates through a much smaller set of items. There are a few inner loops within the second iteration, but they run an even smaller number of times and have almost inconsequential cost.
I timed your algorithm and my algorithm using a text file which weighed in at about 48kb. Your algorithm averaged about 14 seconds on my computer and my algorithm averaged 0.6 seconds. So, by taking away that inner loop, the algorithm is now over 23 times faster. I also made some other minor optimizations, such as changing the comparison to be between numbers rather than text, and creating the storage arrays at full size from the start in order to avoid using append(). Append() causes the interpreter to dynamically increase the array's size as needed, which is slower.
from collections import defaultdict
# Create zero-filled sim1, sim2, sim3 arrays to avoid append() overhead
len_ = len(db[2]) - 2
for _ in range(3):
db.append([0] * len_)
# Create dictionary, containing d['word'] = [count, [indexes]]
# Do just one full iteration, and make good use of it by calculating
# sim1 (as 'count') and storing an array of number indexes for each word,
# allowing for a very efficient loop coming up...
d = defaultdict(lambda: [0, []])
for index, word in enumerate(db[2]):
if index < len_:
# Accumulate sim1
d[word][0] += 1
# Store all db[2] indexes where this word exists
d[word][1].append(index)
# Now loop only through words which occur more than once (smaller loop)
for word, (count, indexes) in d.iteritems():
if count > 1:
# Place the sim1 values into the db[3] array
for i in indexes:
if i < len_:
db[3][i] = count - 1
# Look for sim2 matches by using index numbers
next_word = db[2][i+1]
for next_word_index in d[next_word][1]:
if next_word_index - 1 != i and next_word_index - 1 in indexes:
# Accumulate sim2 value in db[4]
db[4][i] += 1
# Look for sim3 matches
third_word = db[2][i+2]
if third_word == db[2][next_word_index + 1]:
# Accumulate sim3 value in db[5]
db[5][i] += 1
Yep, you're performaing a string compare. That's really slow.
What you want is to compile your string as a regular pattern. :)
Have a look onto the libary re from python.
Python: re

Finding the most similar numbers across multiple lists in Python

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!
I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.
If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:
# setup
def distance(nplet):
'''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
A smaller return value means better agreement.'''
# your choice of implementation here. Example:
return variance(nplet)
# algorithm
def brute_force(*lists):
return min(itertools.product(*lists), key = distance)
For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.
def item_selection(listA, listB, listC):
# make the list of partially-filled triplets
triplets = [[a] for a in listA]
iT = 0
iB = 0
while iT < len(triplets):
# make iB the index of a value in listB closes to triplets[iT][0]
while iB < len(listB) and listB[iB] < triplets[iT][0]:
iB += 1
if iB == 0:
triplets[iT].append(listB[0])
elif iB == len(listB)
triplets[iT].append(listB[-1])
else:
# look at the values in listB just below and just above triplets[iT][0]
# and add the closer one as the second member of the triplet
dist_lower = distance([triplets[iT][0], listB[iB]])
dist_upper = distance([triplets[iT][0], listB[iB + 1]])
if dist_lower < dist_upper:
triplets[iT].append(listB[iB])
elif dist_lower > dist_upper:
triplets[iT].append(listB[iB + 1])
else:
# if they are equidistant, add both
triplets[iT].append(listB[iB])
iT += 1
triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
iT += 1
# then another loop while iT < len(triplets) to add in the numbers from listC
return min(triplets, key = distance)
The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.
def symmetrized_item_selection(listA, listB, listC):
best_results = []
for ordering in itertools.permutations([listA, listB, listC]):
best_results.extend(item_selection(*ordering))
return min(best_results, key = distance)
Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.
def pair_sorting(listA, listB, listC):
# make all possible pairs of values from two lists
# each pair has the structure ((number, origin_list),(number, origin_list))
# so we know which lists the numbers came from
all_pairs = []
all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
# make a dict to track which (number, origin_list)s we've already seen
pairs_by_number_and_list = collections.defaultdict(list)
min_distance = INFINITY
min_triplet = None
# start with the closest pair
for pair in all_pairs:
# for the first value of the current pair, see if we've seen that particular
# (number, origin_list) combination before
for pair2 in pairs_by_number_and_list[pair[0]]:
# if so, that means the current pair shares its first value with
# another pair, so put the 3 unique values together to make a triplet
this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
# check if the triplet agrees more than the previous best triplet
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# do the same thing but checking the second element of the current pair
for pair2 in pairs_by_number_and_list[pair[1]]:
this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# finally, add the current pair to the list of pairs we've seen
pairs_by_number_and_list[pair[0]].append(pair)
pairs_by_number_and_list[pair[1]].append(pair)
return min_triplet
N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.
N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList
def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.
Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.
Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

Categories