Finding the most similar numbers across multiple lists in Python - python

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.
If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:
# setup
def distance(nplet):
'''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
A smaller return value means better agreement.'''
# your choice of implementation here. Example:
return variance(nplet)
# algorithm
def brute_force(*lists):
return min(itertools.product(*lists), key = distance)
For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.
def item_selection(listA, listB, listC):
# make the list of partially-filled triplets
triplets = [[a] for a in listA]
iT = 0
iB = 0
while iT < len(triplets):
# make iB the index of a value in listB closes to triplets[iT][0]
while iB < len(listB) and listB[iB] < triplets[iT][0]:
iB += 1
if iB == 0:
triplets[iT].append(listB[0])
elif iB == len(listB)
triplets[iT].append(listB[-1])
else:
# look at the values in listB just below and just above triplets[iT][0]
# and add the closer one as the second member of the triplet
dist_lower = distance([triplets[iT][0], listB[iB]])
dist_upper = distance([triplets[iT][0], listB[iB + 1]])
if dist_lower < dist_upper:
triplets[iT].append(listB[iB])
elif dist_lower > dist_upper:
triplets[iT].append(listB[iB + 1])
else:
# if they are equidistant, add both
triplets[iT].append(listB[iB])
iT += 1
triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
iT += 1
# then another loop while iT < len(triplets) to add in the numbers from listC
return min(triplets, key = distance)
The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.
def symmetrized_item_selection(listA, listB, listC):
best_results = []
for ordering in itertools.permutations([listA, listB, listC]):
best_results.extend(item_selection(*ordering))
return min(best_results, key = distance)
Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.
def pair_sorting(listA, listB, listC):
# make all possible pairs of values from two lists
# each pair has the structure ((number, origin_list),(number, origin_list))
# so we know which lists the numbers came from
all_pairs = []
all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
# make a dict to track which (number, origin_list)s we've already seen
pairs_by_number_and_list = collections.defaultdict(list)
min_distance = INFINITY
min_triplet = None
# start with the closest pair
for pair in all_pairs:
# for the first value of the current pair, see if we've seen that particular
# (number, origin_list) combination before
for pair2 in pairs_by_number_and_list[pair[0]]:
# if so, that means the current pair shares its first value with
# another pair, so put the 3 unique values together to make a triplet
this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
# check if the triplet agrees more than the previous best triplet
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# do the same thing but checking the second element of the current pair
for pair2 in pairs_by_number_and_list[pair[1]]:
this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
this_distance = distance(this_triplet)
if this_distance < min_distance:
min_triplet = this_triplet
min_distance = this_distance
# finally, add the current pair to the list of pairs we've seen
pairs_by_number_and_list[pair[0]].append(pair)
pairs_by_number_and_list[pair[1]].append(pair)
return min_triplet
N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.
N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

Related

Merge sorting algorithm in Python for two sorted lists - trouble constructing for-loop

I'm trying to create an algorithm to merge two ordered lists into a larger ordered list in Python. Essentially I began by trying to isolate the minimum elements in each list and then I compared them to see which was smallest, because that number would be smallest in the larger list as well. I then appended that element to the empty larger list, and then deleted it from the original list it came from. I then tried to loop through the original two lists doing the same thing. Inside the "if" statements, I've essentially tried to program the function to append the remainder of one list to the larger function if the other is/becomes empty, because there would be no point in asking which elements between the two lists are comparatively smaller then.
def merge_cabs(cab1, cab2):
for (i <= all(j) for j in cab1):
for (k <= all(l) for l in cab2):
if cab1 == []:
newcab.append(cab2)
if cab2 == []:
newcab.append(cab1)
else:
k = min(min(cab1), min(cab2))
newcab.append(k)
if min(cab1) < min(cab2):
cab1.remove(min(cab1))
if min(cab2) < min(cab1):
cab2.remove(min(cab2))
print(newcab)
cab1 = [1,2,5,6,8,9]
cab2 = [3,4,7,10,11]
newcab = []
merge_cabs(cab1, cab2)
I've had a bit of trouble constructing the for-loop unfortunately. One way I've tried to isolate the minimum values was as I wrote in the two "for" lines. Right now, Python is returning "SyntaxError: invalid syntax," pointing to the colon in the first "for" line. Another way I've tried to construct the for-loop was like this:
def merge_cabs(cabs1, cabs2):
for min(i) in cab1:
for min(j) in cab2:
I've also tried to write the expression all in one line like this:
def merge_cabs(cab1, cab2):
for min(i) in cabs1 and min(j) in cabs2:
and to loop through a copy of the original lists rather than looping through the lists themselves, because searching through the site, I've found that it can sometimes be difficult to remove elements from a list you're looping through. I've also tried to protect the expressions after the "for" statements inside various configurations of parentheses. If someone sees where the problem(s) lies, it would really be great if you could point it out, or if you have any other observations that could help me better construct this function, I would really appreciate those too.
Here's a very simple-minded solution to this that uses only very basic Python operations:
def merge_cabs(cab1, cab2):
len1 = len(cab1)
len2 = len(cab2)
i = 0
j = 0
newcab = []
while i < len1 and j < len2:
v1 = cab1[i]
v2 = cab2[j]
if v1 <= v2:
newcab.append(v1)
i += 1
else:
newcab.append(v2)
j += 1
while i < len1:
newcab.append(cab1[i])
i += 1
while j < len2:
newcab.append(cab2[j])
j += 1
return newcab
Things to keep in mind:
You should not have any nested loops. Merging two sorted lists is typically used to implement a merge sort, and the merge step should be linear. I.e., the algorithm should be O(n).
You need to walk both lists together, choosing the smallest value at east step, and advancing only the list that contains the smallest value. When one of the lists is consumed, the remaining elements from the unconsumed list are simply appended in order.
You should not be calling min or max etc. in your loop, since that will effectively introduce a nested loop, turning the merge into an O(n**2) algorithm, which ignores the fact that the lists are known to be sorted.
Similarly, you should not be calling any external sort function to do the merge, since that will result in an O(n*log(n)) merge (or worse, depending on the sort algorithm), and again ignores the fact that the lists are known to be sorted.
Firstly, there's a function in the (standard library) heapq module for doing exactly this, heapq.merge; if this is a real problem (rather than an exercise), you want to use that one instead.
If this is an exercise, there are a couple of points:
You'll need to use a while loop rather than a for loop:
while cab1 or cab2:
This will keep repeating the body while there are any items in either of your source lists.
You probably shouldn't delete items from the source lists; that's a relatively expensive operation. In addition, on the balance having a merge_lists function destroy its arguments would be unexpected.
Within the loop you'll refer to cab1[i1] and cab2[i2] (and, in the condition, to i1 < len(cab1)).
(By the time I typed out the explanation, Tom Karzes typed out the corresponding code in another answer...)

Re-generate a random index until the indexed element is met the condition n times in python

I know there have been a few similar questions regarding loop and random numbers, but I can't seem to find a solution for my problem.
Say I have a fixed list of numbers from my dataset and a threshold that the number has to meet:
x = (7,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41)
threshold= 25
I need to randomly pick a number from this list. Unfortunately, I cannot just directly loop over my original list, I'm forced to randomly pick an index of the list first, and find my number. So for example if I now randomly generate a index number 1 then I get x[1] which is 11
The final result I need is to find numbers that are greater than the threshold for at least 3 times and then put all the resulting number in a list, then my loop can stop. (The indexes cannot repeat).
As an example, a possible final results would be (27,29,31) (The results can be in any format) . I'm thinking maybe something like this to start but really need help to proceed:
Unless you are particularly concerned about the memory usage of creating an additional filtered list, the simplest would probably be to start by doing this:
filtered = [i for i in x if i > threshold]
You can then choose three samples from this filtered list (after import random). The following will potentially choose the same item more than once:
random.choices(filtered, k=3)
or if you want to avoid choosing the same item more than once:
random.sample(filtered, k=3)
Each of the above functions will output a list. Use tuple(....) on the output if you need to convert it to a tuple.
First a clarification. Do you need to pick a random element from the list each iteration, or do you need to pick a different random element from the list each time. I.e., can the same index be picked twice? You're doing the latter.
Second, you want to use range(len(x)). You don't want to hardwire the length of x into your code, and you want index 0 to be a possibility. random.shuffle() may be a better choice.
Lastly, you want to do something like:
result = []
for ....
if select >= threshold:
result.append(select)
if len(result) >= 3: break
If we assume the following constraints:
We are not allowed to loop over the original list (including list comprehension)
We are only allowed to access one member of the original list at a time through its index
We must pick 3 distinct members of the list that are greater or equal to the threshold
The following code should satisfy all of them:
x = [ 7,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41 ]
threshold = 25
result_index = []
while len(result_index) < 3:
index = random.range(0, len(x))
if x[index] >= threshold and index not in result_index:
result_index.append(index)
result = [ x[a] for a in result_index ]
Here is how this works:
In the loop, we store indices, not the numbers them selves.
For each index we check 2 conditions: there is a number there that is bigger or equal to the threshold and we haven't seen this index before.
If the conditions are satisfied, we save the index, not the number!
Repeat until we have 3 indices.
Build new list by getting numbers from those indices directly.

Keeping count of values available from among multiple sets

I have the following situation:
I am generating n combinations of size 3 from, made from n values. Each kth combination [0...n] is pulled from a pool of values, located in the kth index of a list of n sets. Each value can appear 3 times. So if I have 10 values, then I have a list of size 10. Each index holds a set of values 0-10.
So, it seems to me that a good way to do this is to have something keeping count of all the available values from among all the sets. So, if a value is rare(lets say there is only 1 left), if I had a structure where I could look up the rarest value, and have the structure tell me which index it was located in, then it would make generating the possible combinations much easier.
How could I do this? What about one structure to keep count of elements, and a dictionary to keep track of list indices that contain the value?
edit: I guess I should put in that a specific problem I am looking to solve here, is how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
Thank you.
Another edit
It seems that this may be a little too abstract to be asking for solutions when it's hard to understand what I am even asking for. I will come back with some code soon, please check back in 1.5-2 hours if you are interested.
how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
I assume you want to sample the values truly randomly, right? What if you put 3 of each value into a list, shuffle it with random.shuffle, and then just keep popping values from the end of the list when you're building your combination? If I'm understanding your problem right, here's example code:
from random import shuffle
valid_values = [i for i in range(10)] # the valid values are 0 through 9 in my example, update accordingly for yours
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop(), vals.pop(), vals.pop()) # combinations are 3 values?
print(combination)
EDIT: Updated code based on the added information that you have sets of values (but this still assumes you can use more than one value from a given set):
from random import shuffle
my_sets_of_vals = [......] # list of sets
valid_values = list()
for i in range(my_sets_of_vals):
for val in my_sets_of_vals[i]:
valid_values.append((i,val)) # this can probably be done in list comprehension but I forgot the syntax
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop()[1], vals.pop()[1], vals.pop()[1]) # combinations are 3 values?
print(combination)
Based on the edit you could make an object for each value. It could hold the number of times you have used the element and the element itself. When you find you have used an element three times, remove it from the list

Compare DB row values efficiently

I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently?
Is there a name for this approach?
Thanks.
Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.
You can keep track of which documents you have already compared, e.g. (with numbers ;))
compared = set()
for i in [1,2,3]:
for j in [1,2,3]:
pair = frozenset((i,j))
if i != k and pair not in compared:
compare.add(pair)
compare(i,j)
Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.
Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:
docs = [1,2,3]
l = len(docs)
for i in range(l):
for j in range(i+1,l):
compare(l[i],l[j])
Something like this?
src = [1,2,3]
for i, x in enumerate(src):
for y in src[i:]:
compare(x, y)
Or you might wish to generate a list of pairs instead:
pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList
def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.
Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.
Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

Categories