Performance issues in Burrows-Wheeler in python

Performance issues in Burrows-Wheeler in python - python

I was trying to implement Burrows-Wheeler transform in python. (This is one of the assignments in online course, but I hope I have done some work to be qualified to ask for help).
The algorithm works as follows. Take a string which ends with a special character ($ in my case) and create all cyclic strings from this string. Sort all these strings alphabetically, having a special character always less then any other character. After this get the last element of each string.
This gave me a oneliner:
''.join([i[-1] for i in sorted([text[i:] + text[0:i] for i in xrange(len(text))])]
Which is correct and reasonably fast for reasonably big strings (which is enough to solve the problem):
60 000 chars - 16 secs
40 000 chars - 07 secs
25 000 chars - 02 secs
But when I tried to process a really huge string with few millions of chars, I failed (it takes too much time to process).
I assume that the problem is with storing too many strings in the memory.
Is there any way to overcome this?
P.S. just want to point out that also this might look like a homework problem, my solution already passes the grader and I am just looking for a way to make it faster. Also I am not spoiling the fun for other people, because if they would like to find solution, wiki article has one which is similar to mine. I also checked this question which sounds similar but answers a harder question, how to decode the string coded with this algorithm.

It takes a long time to make all those string slices with long strings. It's at least O(N^2) (since you create N strings of N length, and each one has to be copied into memory taking its source data from the original), which destroys the overall performance and makes the sorting irrelevant. Not to mention the memory requirement!
Instead of actually slicing the string, the next thought is to order the i values you use to create the cyclic strings, in order of how the resulting string would compare - without actually creating it. This turns out to be somewhat tricky. (Removed/edited some stuff here that was wrong; please see #TimPeters' answer.)
The approach I've taken here is to bypass the standard library - which makes it difficult (though not impossible) to compare those strings 'on demand' - and do my own sorting. The natural choice of algorithm here is radix sort, since we need to consider the strings one character at a time anyway.
Let's get set up first. I am writing code for version 3.2, so season to taste. (In particular, in 3.3 and up, we could take advantage of yield from.) I am using the following imports:
from random import choice
from timeit import timeit
from functools import partial
I wrote a general-purpose radix sort function like this:
def radix_sort(values, key, step=0):
if len(values) < 2:
for value in values:
yield value
return
bins = {}
for value in values:
bins.setdefault(key(value, step), []).append(value)
for k in sorted(bins.keys()):
for r in radix_sort(bins[k], key, step + 1):
yield r
Of course, we don't need to be general-purpose (our 'bins' can only be labelled with single characters, and presumably you really mean to apply the algorithm to a sequence of bytes ;) ), but it doesn't hurt. Might as well have something reusable, right? Anyway, the idea is simple: we handle a base case, and then we drop each element into a "bin" according to the result from the key function, and then we pull values out of the bins in sorted bin order, recursively sorting each bin's contents.
The interface requires that key(value, n) gives us the nth "radix" of value. So for simple cases, like comparing strings directly, that could be a simple as lambda v, n: return v[n]. Here, though, the idea is to compare indices into the string, according to the data in the string at that point (considered cyclically). So let's define a key:
def bw_key(text, value, step):
return text[(value + step) % len(text)]
Now the trick to getting the right results is to remember that we're conceptually joining up the last characters of the strings we aren't actually creating. If we consider the virtual string made using index n, its last character is at index n - 1, because of how we wrap around - and a moment's thought will confirm to you that this still works when n == 0 ;) . [However, when we wrap forwards, we still need to keep the string index in-bounds - hence the modulo operation in the key function.]
This is a general key function that needs to be passed in the text to which it will refer when transforming the values for comparison. That's where functools.partial comes in - you could also just mess around with lambda, but this is arguably cleaner, and I've found it's usually faster, too.
Anyway, now we can easily write the actual transform using the key:
def burroughs_wheeler_custom(text):
return ''.join(text[i - 1] for i in radix_sort(range(len(text)), partial(bw_key, text)))
# Notice I've dropped the square brackets; this means I'm passing a generator
# expression to `join` instead of a list comprehension. In general, this is
# a little slower, but uses less memory. And the underlying code uses lazy
# evaluation heavily, so :)
Nice and pretty. Let's see how it does, shall we? We need a standard to compare it against:
def burroughs_wheeler_standard(text):
return ''.join([i[-1] for i in sorted([text[i:] + text[:i] for i in range(len(text))])])
And a timing routine:
def test(n):
data = ''.join(choice('abcdefghijklmnopqrstuvwxyz') for i in range(n)) + '$'
custom = partial(burroughs_wheeler_custom, data)
standard = partial(burroughs_wheeler_standard, data)
assert custom() == standard()
trials = 1000000 // n
custom_time = timeit(custom, number=trials)
standard_time = timeit(standard, number=trials)
print("custom: {} standard: {}".format(custom_time, standard_time))
Notice the math I've done to decide on a number of trials, inversely related to the length of the test string. This should keep the total time used for testing in a reasonably narrow range - right? ;) (Wrong, of course, since we established that the standard algorithm is at least O(N^2).)
Let's see how it does (*drumroll*):
>>> imp.reload(burroughs_wheeler)
<module 'burroughs_wheeler' from 'burroughs_wheeler.py'>
>>> burroughs_wheeler.test(100)
custom: 4.7095093091438684 standard: 0.9819262643716229
>>> burroughs_wheeler.test(1000)
custom: 5.532266880287807 standard: 2.1733253807396977
>>> burroughs_wheeler.test(10000)
custom: 5.954826800612864 standard: 42.50686064849015
Whoa, that's a bit of a frightening jump. Anyway, as you can see, the new approach adds a ton of overhead on short strings, but enables the actual sorting to be the bottleneck instead of string slicing. :)

Just adding a bit to #KarlKnechtel's spot-on response.
First, the "standard way" to speed cyclic-permutation extraction is just to paste two copies together and index directly into that. After:
N = len(text)
text2 = text * 2
then the cyclic permutation starting at index i is just text2[i: i+N], and character j in that permutation is just text2[i+j]. No need for pasting together two slices, or for modulus (%) operations.
Second, the builtin sort() can be used for this, although:
It's funky ;-)
For strings with few distinct characters (compared to the length of the string) Karl's radix sort will almost certainly be faster.
As proof-of-concept, here's a drop-in replacement for that part of Karl's code (although this sticks to Python 2):
def burroughs_wheeler_custom(text):
N = len(text)
text2 = text * 2
class K:
def __init__(self, i):
self.i = i
def __lt__(a, b):
i, j = a.i, b.i
for k in xrange(N): # use `range()` in Python 3
if text2[i+k] < text2[j+k]:
return True
elif text2[i+k] > text2[j+k]:
return False
return False # they're equal
inorder = sorted(range(N), key=K)
return "".join(text2[i+N-1] for i in inorder)
Note that the builtin sort()'s implementation computes the key exactly once for each element in its input, and does save those results for the duration of the sort. In this case, the results are lazy little K instances that just remember the starting index, and whose __lt__ method compares one character pair at a time until "less than!" or "greater than!" is resolved.

I agree with the previous answer, string/list slicing in python becomes a bottleneck when performing huge algorithmic computations. The idea is not slicing.
[EDIT: not also slicing, but list indexing. If you use array.array instead of lists, the execution time reduces to a half. Indexing arrays is straightforward, indexing lists is a more complicated process) ]
Here there is a more functional solution to your problem.
The idea, is having a generator the will act as a slicer (rslice). It's a similar idea to itertools.islice but it goes to the beginning of the string when it reaches the end. And it will stop before reaching the start position you specified when creating it. With this trick you are not copying any substrings in memory, so in the end you only have pointers
moving over your string without creating copies everywhere.
So we create a list containing [rslices,lastchar of the slice]
and we sort it using as key the rslice ( as you can see in cf sort function).
When it's sorted, you will only need to collect for each element in the list the second element (last element of the slice previously stored).
from itertools import izip
def cf(i1,i2):
for i,j in izip(i1[0](),i2[0]()): # We grab the the first element (is a lambda) and execute it to get the generator
if i<j: return -1
elif i>j: return 1
return 0
def rslice(cad,pos): # Slice that rotates through the string (it's a generator)
pini=pos
lc=len(cad)
while pos<lc:
yield cad[pos]
pos+=1
pos=0
while pos<pini-1:
yield cad[pos]
pos+=1
def lambdagen(start,cad): # Closure to hold a generator
return lambda: rslice(cad,start)
def bwt(txt):
lt=len(txt)
arry=list(txt)+[None]
l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)]
# What we keep in the list is the generator for the rotating-slice, plus the
# last character of the slice, so we save the time of going through the whole
# string to get the last character
l.sort(cmp=cf) # We sort using our cf function
return [i[1] for i in l]
print bwt('Text I want to apply BTW to :D')
# ['D', 'o', 'y', 't', 'o', 'W', 't', 'I', ' ', ' ', ':', ' ', 'B', None, 'T', 'w', ' ',
# 'T', 'p', 'a', 't', 't', 'p', 'a', 'x', 'n', ' ', ' ', ' ', 'e', 'l']
EDIT: Using arrays (execution time reduced by 2):
def bwt(txt):
lt=len(txt)
arry=array.array('h',[ord(i) for i in txt])
arry.append(-1)
l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)]
l.sort(cmp=cf)
return [i[1] for i in l]

Related

At each iteration I am 'redefining/reallocating' or restoring a variable, does this take extra space?

I have written a solution for the following problem: https://leetcode.com/problems/remove-all-adjacent-duplicates-in-string-ii/
To summarise the question, we are given an integer k, and we need to remove duplicates of length k, from the input string s.
In my solution, I write out all the possible duplicates, which is an array of 26 elements. Then I iterate over the input string s, and remove the duplicates from it, or rather, I use slicing to redefine s:
def removeDuplicates(s: str, k: int) -> str:
dup = [k * i for i in "qwertyuiopasdfghjklzxcvbnm"]
pointer1 = 0
while pointer1+k<len(s):
if s[pointer1:pointer1+k] in dup:
s = s[:pointer1] + s[pointer1+k:]
if pointer1>1:
pointer1-=2*k
if s[-k:len(s)] in dup:
s = s[:-k]
else:
pointer1+=1
return s
My understanding is that the time complexity is O(n), where n is the length of the input string, since in the worst case we iterate over the whole string without removing any characters. Is this correct?
The space complexity is where I am more unsure, as each time I find a duplicate, I am redefining s, so will need to allocate 'new' space for this. Is it right to say that the space complexity is O(n) as well?

Actually, it depends on the language you are working on. Like java have garbage collector and similar thing is also there in python. But in c there we can say we have O(n) space complexity.

Time complexity for two different solutions

I want to understand the difference in time complexity between these two solutions.
The task is not relevant but if you're curious here's the link with the explanation.
This is my first solution. Scores a 100% in correctness but 0% in performance:
def solution(s, p ,q):
dna = dict({'A': 1, 'C': 2, 'G': 3, 'T': 4})
result = []
for i in range(len(q)):
least = 4
for c in set(s[p[i] : q[i] + 1]):
if least > dna[c]: least = dna[c]
result.append(least)
return result
This is the second solution. Scores 100% in both correctness and performance:
def solution(s, p ,q):
result = []
for i in range(len(q)):
if 'A' in s[p[i]:q[i] + 1]: result.append(1)
elif 'C' in s[p[i]:q[i] + 1]: result.append(2)
elif 'G' in s[p[i]:q[i] + 1]: result.append(3)
else: result.append(4)
return list(result)
Now this is how I see it. In both solutions I'm iterating through a range of Q length and on each iteration I'm slicing different portions of a string, with a length between 1 and 100,000.
Here's where I get confused, in my first solution on each iteration, I'm slicing once a portion of the string and create a set to remove all the duplicates. The set can have a length between 1 and 4, so iterating through it must be very quick. What I notice is that I iterate through it only once, on each iteration.
In the second solution on each iteration, I'm slicing three times a portion of the string and iterate through it, in the worst case three times with a length of 100,000.
Then why is the second solution faster? How can the first have a time complexity of O(n*m) and the second O(n+m)?
I thought it's because of the in and the for operators, but I tried the same second solution in JavaScript with the indexOf method and it still gets a 100% in performance. But why? I can understand that if in Python the in and the for operators have different implementations and work differently behind the scene, but in JS the indexOf method is just going to apply a for loop. Then isn't it the same as just doing the for loop directly inside my function? Shouldn't that be a O(n*m) time complexity?

You haven't specified how the performance rating is obtained, but anyway, the second algorithm is clearly better, mainly because it uses the in operator, that under the hood calls a function implemented in C, which is far more efficient than python. More on this topic here.
Also, I'm not sure, but I don't think that the python interpreter isn't smart enough to slice the string only once and then reuse the same portion the other times in the second algorithm.
Creating the set in the first algorithm also seems like a very costly operation.
Lastly, maybe the performance ratings aren't based on the algorithm complexity, but rather on the execution time over different test strings?

I think the difference in complexity can easily be showcased on an example.
Consider the following input:
s = 'ACGT' * 1000000
# = 'ACGTACGTACGTACGTACGTACGTACGTACGTACGT...ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT'
p = [0]
q = [3999999]
Algorithm 2 very quickly checks that 'A' is in s[0:4000000] (it's the first character - no need to iterate through the whole string to find it!).
Algorithm 1, on the other hand, must iterate through the whole string s[0:4000000] to build the set {'A','C','G','T'}, because iterating through the whole string is the only way to check that there isn't a fifth distinct character hidden somewhere in the string.
Important note
I said algorithm 2 should be fast on this example, because the test 'A' in ... doesn't need to iterate through the whole string to find 'A' if 'A' is at the beginning of the string. However, note a possible important difference in complexity between 'A' in s and 'A' in s[0:4000000]. The problem is that creating a slice of the string might cost time (and memory) if it's copying the string. Instead of slicing, you should use s.find('A', 0, 4000000), which is guaranteed not to build a copy. For more information on this:
Documentation on string.find
Stackoverflow: Time complexity of string slice

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.

I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).

Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)

If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]

The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Memory Efficient Alternatives to Python Dictionaries

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.
This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.
My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.
From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.
So, does anyone have any suggestions for me?
Edited to add:
Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).
The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:
295760 S. Lott's answer
237984 S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156 d[a][b][c] = int(freq)
189132 keys.append((a,b,c)); freqs.append(int(freq))
146132 d[intern(a),intern(b)][intern(c)] = int(freq)
145408 d[intern(a)][intern(b)][intern(c)] = int(freq)
83888 [*] d[a+' '+b+' '+c] = int(freq)
82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
68756 keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
60320 keys.append(a+' '+b+' '+c); freqs.append(int(freq))
50556 pair array
48320 squeezed pair array
33024 squeezed single array
The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)
'Pair array' is the scheme below in my original answer ("I'd start with the array with keys
being the first two words..."), where the value table for each pair is
represented as a single string. 'Squeezed pair array' is the same,
leaving out the frequency values that are equal to 1 (the most common
case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:
import collections
def build(file):
pairs = collections.defaultdict(list)
for line in file: # N.B. file assumed to be already sorted
a, b, c, freq = line.split()
key = ' '.join((a, b))
pairs[key].append(c + ':' + freq if freq != '1' else c)
out = open('squeezedsinglearrayfile', 'w')
for key in sorted(pairs.keys()):
out.write('%s|%s\n' % (key, ' '.join(pairs[key])))
def load():
return open('squeezedsinglearrayfile').readlines()
if __name__ == '__main__':
build(open('freqs'))
I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.
Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.
Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.
I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.
d = {}
d[ word1, word2, word3 ] = 1
Also as a plus, you could use defaultdict
so that elements that don't have entries always return 0
and so that u can say d[w1,w2,w3] += 1 without checking if the key already exists or not
example:
from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1
If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension
if you have a tuple, t, you can get the first two items using slices:
>>> a = (1,2,3)
>>> a[:2]
(1, 2)
a small example for searching tuples with list comprehensions:
>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]
You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)

In this case, ZODB¹ BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.
Something like:
from BTrees.OOBTree import OOBTree as BTree
The interface is, more or less, dict-like, with the added bonus (for you) that .keys, .items, .iterkeys and .iteritems have two min, max optional arguments:
>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]
¹ Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.
PS They exist in the CheeseShop ☺

A couple attempts:
I figure you're doing something similar to this:
from __future__ import with_statement
import time
from collections import deque, defaultdict
# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
d=deque()
with open(words) as f:
for i in range(3):
d.append(f.readline().strip())
while d[-1] != '':
yield tuple(d)
d.popleft()
d.append(f.readline().strip())
if __name__ == '__main__':
class D(dict):
def __missing__(self, key):
self[key] = D()
return self[key]
h=D()
for a, b, c in triplegen():
h[a][b][c] = 1
time.sleep(60)
That gives me ~88MB.
Changing the storage to
h[a, b, c] = 1
takes ~25MB
interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.

Are you implementing Markovian text generation?
If your chains map 2 words to the probabilities of the third I'd use a dictionary mapping K-tuples to the 3rd-word histogram. A trivial (but memory-hungry) way to implement the histogram would be to use a list with repeats, and then random.choice gives you a word with the proper probability.
Here's an implementation with the K-tuple as a parameter:
import random
# can change these functions to use a dict-based histogram
# instead of a list with repeats
def default_histogram(): return []
def add_to_histogram(item, hist): hist.append(item)
def choose_from_histogram(hist): return random.choice(hist)
K=2 # look 2 words back
words = ...
d = {}
# build histograms
for i in xrange(len(words)-K-1):
key = words[i:i+K]
word = words[i+K]
d.setdefault(key, default_histogram())
add_to_histogram(word, d[key])
# generate text
start = random.randrange(len(words)-K-1)
key = words[start:start+K]
for i in NUM_WORDS_TO_GENERATE:
word = choose_from_histogram(d[key])
print word,
key = key[1:] + (word,)

You could try to use same dictionary, only one level deep.
topDictionary[word1+delimiter+word2+delimiter+word3]
delimiter could be plain " ". (or use (word1,word2,word3))
This would be easiest to implement.
I believe you will see a little improvement, if it is not enough...
...i'll think of something...

Ok, so you are basically trying to store a sparse 3D space. The kind of access patterns you want to this space is crucial for the choice of algorithm and data structure. Considering your data source, do you want to feed this to a grid? If you don't need O(1) access:
In order to get memory efficiency you want to subdivide that space into subspaces with a similar number of entries. (like a BTree). So a data structure with :
firstWordRange
secondWordRange
thirdWordRange
numberOfEntries
a sorted block of entries.
next and previous blocks in all 3 dimensions

Scipy has sparse matrices, so if you can make the first two words a tuple, you can do something like this:
import numpy as N
from scipy import sparse
word_index = {}
count = sparse.lil_matrix((word_count*word_count, word_count), dtype=N.int)
for word1, word2, word3 in triple_list:
w1 = word_index.setdefault(word1, len(word_index))
w2 = word_index.setdefault(word2, len(word_index))
w3 = word_index.setdefault(word3, len(word_index))
w1_w2 = w1 * word_count + w2
count[w1_w2,w3] += 1

If memory is simply not big enough, pybsddb can help store a disk-persistent map.

You could use a numpy multidimensional array. You'll need to use numbers rather than strings to index into the array, but that can be solved by using a single dict to map words to numbers.
import numpy
w = {'word1':1, 'word2':2, 'word3':3, 'word4':4}
a = numpy.zeros( (4,4,4) )
Then to index into your array, you'd do something like:
a[w[word1], w[word2], w[word3]] += 1
That syntax is not beautiful, but numpy arrays are about as efficient as anything you're likely to find. Note also that I haven't tried this code out, so I may be off in some of the details. Just going from memory here.

Here's a tree structure that uses the bisect library to maintain a sorted list of words. Each lookup in O(log2(n)).
import bisect
class WordList( object ):
"""Leaf-level is list of words and counts."""
def __init__( self ):
self.words= [ ('\xff-None-',0) ]
def count( self, wordTuple ):
assert len(wordTuple)==1
word= wordTuple[0]
loc= bisect.bisect_left( self.words, word )
if self.words[loc][0] != word:
self.words.insert( loc, (word,0) )
self.words[loc]= ( word, self.words[loc][1]+1 )
def getWords( self ):
return self.words[:-1]
class WordTree( object ):
"""Above non-leaf nodes are words and either trees or lists."""
def __init__( self ):
self.words= [ ('\xff-None-',None) ]
def count( self, wordTuple ):
head, tail = wordTuple[0], wordTuple[1:]
loc= bisect.bisect_left( self.words, head )
if self.words[loc][0] != head:
if len(tail) == 1:
newList= WordList()
else:
newList= WordTree()
self.words.insert( loc, (head,newList) )
self.words[loc][1].count( tail )
def getWords( self ):
return self.words[:-1]
t = WordTree()
for a in ( ('the','quick','brown'), ('the','quick','fox') ):
t.count(a)
for w1,wt1 in t.getWords():
print w1
for w2,wt2 in wt1.getWords():
print " ", w2
for w3 in wt2.getWords():
print " ", w3
For simplicity, this uses a dummy value in each tree and list. This saves endless if-statements to determine if the list was actually empty before we make a comparison. It's only empty once, so the if-statements are wasted for all n-1 other words.

You could put all words in a dictionary.
key would be word, and value is number (index).
Then you use it like this:
Word1=indexDict[word1]
Word2=indexDict[word2]
Word3=indexDict[word3]
topDictionary[Word1][Word2][Word3]
Insert in indexDict with:
if word not in indexDict:
indexDict[word]=len(indexDict)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.