Find string matches count in list using Python

Find string matches count in list using Python - python

I read a text file for some analysis, each word is appended to a list and given an id
#!/usr/bin/python3
with fi as myfile:
for line in myfile:
for item in line.split(' '):
db[0].append(id_+1)
db[2].append(item)
...more stuff
Then I search for each word through the list to find its matches, and store the count as sim1. If a match is found, I test if the next word matches the consecutive one as well, and store its count as sim2. Similarly for sim3. My code looks like:
for i in range(id_-3):
sim1=0
sim2=0
sim3=0
for j in range(id_-3):
if i==j: continue;
if db[2][i] == db[2][j]:
sim1 += 1
if db[2][i+1] == db[2][j+1]:
sim2 += 1
if db[2][i+2] == db[2][j+2]:
sim3 += 1
db[3].append(sim1)
db[4].append(sim2)
db[5].append(sim3)
This works, but it's too slow!
I believe python provides faster search methods, but I'm still a Py newbie!

The slowness in your algorithm mainly comes from the fact that you have an inner loop which iterates len(db[2]) times contained within an outer loop which also iterates len(db[2]) times. This means the inner code is executing len(db[2])^2 times. If your file is large and you are parsing 5000 words, for example, then the code runs 5000^2 = 25,000,000 times!
So, the angle of attack to solve the problem is to find a way to eliminate or significantly reduce the cost of that inner loop. Below is an example solution which only needs to iterate through len(db[2]) one time, and then does a second separate loop which iterates through a much smaller set of items. There are a few inner loops within the second iteration, but they run an even smaller number of times and have almost inconsequential cost.
I timed your algorithm and my algorithm using a text file which weighed in at about 48kb. Your algorithm averaged about 14 seconds on my computer and my algorithm averaged 0.6 seconds. So, by taking away that inner loop, the algorithm is now over 23 times faster. I also made some other minor optimizations, such as changing the comparison to be between numbers rather than text, and creating the storage arrays at full size from the start in order to avoid using append(). Append() causes the interpreter to dynamically increase the array's size as needed, which is slower.
from collections import defaultdict
# Create zero-filled sim1, sim2, sim3 arrays to avoid append() overhead
len_ = len(db[2]) - 2
for _ in range(3):
db.append([0] * len_)
# Create dictionary, containing d['word'] = [count, [indexes]]
# Do just one full iteration, and make good use of it by calculating
# sim1 (as 'count') and storing an array of number indexes for each word,
# allowing for a very efficient loop coming up...
d = defaultdict(lambda: [0, []])
for index, word in enumerate(db[2]):
if index < len_:
# Accumulate sim1
d[word][0] += 1
# Store all db[2] indexes where this word exists
d[word][1].append(index)
# Now loop only through words which occur more than once (smaller loop)
for word, (count, indexes) in d.iteritems():
if count > 1:
# Place the sim1 values into the db[3] array
for i in indexes:
if i < len_:
db[3][i] = count - 1
# Look for sim2 matches by using index numbers
next_word = db[2][i+1]
for next_word_index in d[next_word][1]:
if next_word_index - 1 != i and next_word_index - 1 in indexes:
# Accumulate sim2 value in db[4]
db[4][i] += 1
# Look for sim3 matches
third_word = db[2][i+2]
if third_word == db[2][next_word_index + 1]:
# Accumulate sim3 value in db[5]
db[5][i] += 1

Yep, you're performaing a string compare. That's really slow.
What you want is to compile your string as a regular pattern. :)
Have a look onto the libary re from python.
Python: re

Related

Merge sorting algorithm in Python for two sorted lists - trouble constructing for-loop

I'm trying to create an algorithm to merge two ordered lists into a larger ordered list in Python. Essentially I began by trying to isolate the minimum elements in each list and then I compared them to see which was smallest, because that number would be smallest in the larger list as well. I then appended that element to the empty larger list, and then deleted it from the original list it came from. I then tried to loop through the original two lists doing the same thing. Inside the "if" statements, I've essentially tried to program the function to append the remainder of one list to the larger function if the other is/becomes empty, because there would be no point in asking which elements between the two lists are comparatively smaller then.
def merge_cabs(cab1, cab2):
for (i <= all(j) for j in cab1):
for (k <= all(l) for l in cab2):
if cab1 == []:
newcab.append(cab2)
if cab2 == []:
newcab.append(cab1)
else:
k = min(min(cab1), min(cab2))
newcab.append(k)
if min(cab1) < min(cab2):
cab1.remove(min(cab1))
if min(cab2) < min(cab1):
cab2.remove(min(cab2))
print(newcab)
cab1 = [1,2,5,6,8,9]
cab2 = [3,4,7,10,11]
newcab = []
merge_cabs(cab1, cab2)
I've had a bit of trouble constructing the for-loop unfortunately. One way I've tried to isolate the minimum values was as I wrote in the two "for" lines. Right now, Python is returning "SyntaxError: invalid syntax," pointing to the colon in the first "for" line. Another way I've tried to construct the for-loop was like this:
def merge_cabs(cabs1, cabs2):
for min(i) in cab1:
for min(j) in cab2:
I've also tried to write the expression all in one line like this:
def merge_cabs(cab1, cab2):
for min(i) in cabs1 and min(j) in cabs2:
and to loop through a copy of the original lists rather than looping through the lists themselves, because searching through the site, I've found that it can sometimes be difficult to remove elements from a list you're looping through. I've also tried to protect the expressions after the "for" statements inside various configurations of parentheses. If someone sees where the problem(s) lies, it would really be great if you could point it out, or if you have any other observations that could help me better construct this function, I would really appreciate those too.

Here's a very simple-minded solution to this that uses only very basic Python operations:
def merge_cabs(cab1, cab2):
len1 = len(cab1)
len2 = len(cab2)
i = 0
j = 0
newcab = []
while i < len1 and j < len2:
v1 = cab1[i]
v2 = cab2[j]
if v1 <= v2:
newcab.append(v1)
i += 1
else:
newcab.append(v2)
j += 1
while i < len1:
newcab.append(cab1[i])
i += 1
while j < len2:
newcab.append(cab2[j])
j += 1
return newcab
Things to keep in mind:
You should not have any nested loops. Merging two sorted lists is typically used to implement a merge sort, and the merge step should be linear. I.e., the algorithm should be O(n).
You need to walk both lists together, choosing the smallest value at east step, and advancing only the list that contains the smallest value. When one of the lists is consumed, the remaining elements from the unconsumed list are simply appended in order.
You should not be calling min or max etc. in your loop, since that will effectively introduce a nested loop, turning the merge into an O(n**2) algorithm, which ignores the fact that the lists are known to be sorted.
Similarly, you should not be calling any external sort function to do the merge, since that will result in an O(n*log(n)) merge (or worse, depending on the sort algorithm), and again ignores the fact that the lists are known to be sorted.

Firstly, there's a function in the (standard library) heapq module for doing exactly this, heapq.merge; if this is a real problem (rather than an exercise), you want to use that one instead.
If this is an exercise, there are a couple of points:
You'll need to use a while loop rather than a for loop:
while cab1 or cab2:
This will keep repeating the body while there are any items in either of your source lists.
You probably shouldn't delete items from the source lists; that's a relatively expensive operation. In addition, on the balance having a merge_lists function destroy its arguments would be unexpected.
Within the loop you'll refer to cab1[i1] and cab2[i2] (and, in the condition, to i1 < len(cab1)).
(By the time I typed out the explanation, Tom Karzes typed out the corresponding code in another answer...)

Finding overlapping bits in a set of binary strings

I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).
To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.
The code I use to compare strings:
drug_strings = [] #To store suitable strings for combinatorial pooling rules
class badString(Exception): pass
for k in range(len(strings)):
bit_current = bit_list[k]
try:
for bits in bit_list[:k]:
intersect = set.intersection(set(bit_current),set(bits))
if len(intersect) > 2:
raise badString() #pass on to next iteration if string has overlaps in previous set
drug_strings.append(strings[k])
except badString:
pass
However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?
EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:
my_strings = []
my_bits = []
for k in range(2000):
random = np.random.randint(0, len(strings_77))
string = strings_77.pop(random)
bits = [m.start()+1 for m in re.finditer('1',string)]
if all(len(set(bits) & set(my_bit)) <= 2
for my_bit in my_bits[:k]):
my_strings.append(string)
my_bits.append(bits)
Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!

Raising exceptions is expensive. A complex data structure is created and the stack has to be unwound. In fact, setting up a try/except block is expensive.
Really you're wanting to check that all intersections have length less than or equal to two, and then append. There is no need for exceptions.
for k in range(len(strings)):
bit_current = bit_list[k]
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(strings[k])
Also, instead of having to look up the strings and bit_list index, you can iterate over all the parts you need at the same time. You still need the index for the bit_list slice:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)
You can also avoid creating the bit_current set with each loop:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
bit_set = set(bit_current)
if all(len(bit_set & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)

Some minor things that I would improve in your code that may be causing some overhead:
set(bit_current) move this outside the inner loop;
remove the raise except part;
Since you have this condition if len(intersect) > 2: you could try to implement the interception method to stop when that condition is meet. So that you avoid unnecessary computation.
So the code would become:
for k in range(len(strings)):
bit_current = set(bit_list[k])
intersect = []
for bits in bit_list[:k]:
intersect = []
b = set(bits)
for i in bit_current:
if i in b:
intersect.append(i)
if len(intersect) > 2:
break
if len(intersect) > 2:
break
if len(intersect) <= 2:
drug_strings.append(strings[k])

Performance task in Python

We have IT olimpics in my country. Normally they are written in Java, C or C++. I gues for a year or so they also include other languages like python.
I tried to solve a task from previous years in Python called Letters and I'm constantly failing. The task is to write a code that counts minimum number of shifts between neighboring letters to turn one string into another.
As input you get number of letters in one string and two strings with same amount of letters but in different order. Lenght of one string is from 2 to 1 000 000 letters. There are only capital letters, they can but don't have to be sorted and can repeat.
Here's an example:
7
AABCDDD
DDDBCAA
Correct output should be 16
As output you have to return single value which is minimum number of shifts. It has to calculate output under 5seconds.
I made it calculate correct output, but in longer strings (lik 800 000 letters) it starts to slow down. The longest inputs return value in about 30 seconds. There's also one input counting 900 000 letters per word that calculates 30 minutes!
Under link you can find all input files for tests:
https://oi.edu.pl/l/19oi_ksiazeczka/
Click on this link to download files for "Letters" task:
XIX OI testy i rozwiązania - zad. LIT (I etap) (3.5 MB)
Bellow is my code. How can I speed it up?
# import time
import sys
# start = time.time()
def file_reader():
standard_input=""
try:
data = sys.stdin.readlines()
for line in data:
standard_input+=line
except:
print("An exception occurred")
return standard_input
def mergeSortInversions(arr):
if len(arr) == 1:
return arr, 0
else:
a = arr[:len(arr)//2]
b = arr[len(arr)//2:]
a, ai = mergeSortInversions(a)
b, bi = mergeSortInversions(b)
c = []
i = 0
j = 0
inversions = 0 + ai + bi
while i < len(a) and j < len(b):
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
inversions += (len(a)-i)
c += a[i:]
c += b[j:]
return c, inversions
def literki():
words=(file_reader()).replace("\n", "")
number = int("".join((map(str, ([int(i) for i in list(words) if i.isdigit()])))))
all_letters = [x for x in list(words) if x not in str(number)]
name = all_letters[:number]
anagram = all_letters[number:]
p=[]
index=list(range(len(anagram)))
anagram_dict = {index[i]: anagram[i] for i in range(len(index))}
new_dict = {}
anagram_counts={}
for key, value in anagram_dict.items():
if value in new_dict:
new_dict[value].append(key)
else:
new_dict[value]=[key]
for i in new_dict:
anagram_counts.update({i:new_dict[i]})
for letter in name:
a=anagram_counts[letter]
p.append(a.pop(0))
print(mergeSortInversions(p)[1])
#>>
literki()
# end = time.time()
# print(start-end)
So to explain what it does in parts: File_reader: simply reads an input file from standard input. mergeSortInversions(arr): normally it would sort a string, but here I wanted it to return sum of inversions. I'm not that smart to figure it out by myself, I found it on web but it does the job. Unfortunatelly, for 1mln strings it does that in 10 secondes or so. In "literki" function: first, I've devided input to have number of signs and two, even in lenght words as lists.
Then, I've made something similar in function to stacks array (not shure if it is called this way in english). basically I made a dictionary with every letter as key and indexes of those letters as a list in values (if a letter occurs more than once, value would contain a list of all indexes for that letter). Last thing I did before "the slow thing", for every letter in "name" variable I've extracted coresponding index. Up to that point all opertations for every input, ware taking arround 2 secconds. And now two lines that generate the rest of time for calculating outcome: - I append the index to p=[] list and in the same time pop it from list in dictionary, so it wouldn't read it again for another same letter. - I calculate number of moves (inversions) with mergeSortInversions(arr) based on p=[...] list and print it as output.
I know that poping from bottom is slow but on the other hand I would have to create lists of indexes from bottom (so I could pop index from top) but that took even longer. I've also tried converting a=[... ] with deque but it also was to slow.

I think I'd try a genetic algorithm for this problem. GA's don't always come up with an optimal solution, but they are very good for getting an acceptable solution in a reasonable amount of time. And for small inputs, they can be optimal.
The gist is to come up with:
1) A fitness function that assigns a number indicating how good a particular candidate solution is
2) A sexual reproduction function, that combines, in a simple way, part of two candidate solutions
3) A mutation function, that introduces one small change to a candidate solution.
So you just let those functions go to town, creating solution after solution, and keeping the best ones - not the best one, the best ones.
Then after a while, the best solution found is your answer.
Here's an example of using a GA for another hard problem, called The House Robber Problem. It's in Python:
http://stromberg.dnsalias.org/~strombrg/house-robber-problem/

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)

If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]

The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList

def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.

Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.

Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.