Memory Efficient Alternatives to Python Dictionaries

Memory Efficient Alternatives to Python Dictionaries - python

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.
This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.
My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.
From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.
So, does anyone have any suggestions for me?
Edited to add:
Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).
The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:
295760 S. Lott's answer
237984 S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156 d[a][b][c] = int(freq)
189132 keys.append((a,b,c)); freqs.append(int(freq))
146132 d[intern(a),intern(b)][intern(c)] = int(freq)
145408 d[intern(a)][intern(b)][intern(c)] = int(freq)
83888 [*] d[a+' '+b+' '+c] = int(freq)
82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
68756 keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
60320 keys.append(a+' '+b+' '+c); freqs.append(int(freq))
50556 pair array
48320 squeezed pair array
33024 squeezed single array
The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)
'Pair array' is the scheme below in my original answer ("I'd start with the array with keys
being the first two words..."), where the value table for each pair is
represented as a single string. 'Squeezed pair array' is the same,
leaving out the frequency values that are equal to 1 (the most common
case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:
import collections
def build(file):
pairs = collections.defaultdict(list)
for line in file: # N.B. file assumed to be already sorted
a, b, c, freq = line.split()
key = ' '.join((a, b))
pairs[key].append(c + ':' + freq if freq != '1' else c)
out = open('squeezedsinglearrayfile', 'w')
for key in sorted(pairs.keys()):
out.write('%s|%s\n' % (key, ' '.join(pairs[key])))
def load():
return open('squeezedsinglearrayfile').readlines()
if __name__ == '__main__':
build(open('freqs'))
I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.
Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.
Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.
I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.
d = {}
d[ word1, word2, word3 ] = 1
Also as a plus, you could use defaultdict
so that elements that don't have entries always return 0
and so that u can say d[w1,w2,w3] += 1 without checking if the key already exists or not
example:
from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1
If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension
if you have a tuple, t, you can get the first two items using slices:
>>> a = (1,2,3)
>>> a[:2]
(1, 2)
a small example for searching tuples with list comprehensions:
>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]
You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)

In this case, ZODB¹ BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.
Something like:
from BTrees.OOBTree import OOBTree as BTree
The interface is, more or less, dict-like, with the added bonus (for you) that .keys, .items, .iterkeys and .iteritems have two min, max optional arguments:
>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]
¹ Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.
PS They exist in the CheeseShop ☺

A couple attempts:
I figure you're doing something similar to this:
from __future__ import with_statement
import time
from collections import deque, defaultdict
# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
d=deque()
with open(words) as f:
for i in range(3):
d.append(f.readline().strip())
while d[-1] != '':
yield tuple(d)
d.popleft()
d.append(f.readline().strip())
if __name__ == '__main__':
class D(dict):
def __missing__(self, key):
self[key] = D()
return self[key]
h=D()
for a, b, c in triplegen():
h[a][b][c] = 1
time.sleep(60)
That gives me ~88MB.
Changing the storage to
h[a, b, c] = 1
takes ~25MB
interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.

Are you implementing Markovian text generation?
If your chains map 2 words to the probabilities of the third I'd use a dictionary mapping K-tuples to the 3rd-word histogram. A trivial (but memory-hungry) way to implement the histogram would be to use a list with repeats, and then random.choice gives you a word with the proper probability.
Here's an implementation with the K-tuple as a parameter:
import random
# can change these functions to use a dict-based histogram
# instead of a list with repeats
def default_histogram(): return []
def add_to_histogram(item, hist): hist.append(item)
def choose_from_histogram(hist): return random.choice(hist)
K=2 # look 2 words back
words = ...
d = {}
# build histograms
for i in xrange(len(words)-K-1):
key = words[i:i+K]
word = words[i+K]
d.setdefault(key, default_histogram())
add_to_histogram(word, d[key])
# generate text
start = random.randrange(len(words)-K-1)
key = words[start:start+K]
for i in NUM_WORDS_TO_GENERATE:
word = choose_from_histogram(d[key])
print word,
key = key[1:] + (word,)

You could try to use same dictionary, only one level deep.
topDictionary[word1+delimiter+word2+delimiter+word3]
delimiter could be plain " ". (or use (word1,word2,word3))
This would be easiest to implement.
I believe you will see a little improvement, if it is not enough...
...i'll think of something...

Ok, so you are basically trying to store a sparse 3D space. The kind of access patterns you want to this space is crucial for the choice of algorithm and data structure. Considering your data source, do you want to feed this to a grid? If you don't need O(1) access:
In order to get memory efficiency you want to subdivide that space into subspaces with a similar number of entries. (like a BTree). So a data structure with :
firstWordRange
secondWordRange
thirdWordRange
numberOfEntries
a sorted block of entries.
next and previous blocks in all 3 dimensions

Scipy has sparse matrices, so if you can make the first two words a tuple, you can do something like this:
import numpy as N
from scipy import sparse
word_index = {}
count = sparse.lil_matrix((word_count*word_count, word_count), dtype=N.int)
for word1, word2, word3 in triple_list:
w1 = word_index.setdefault(word1, len(word_index))
w2 = word_index.setdefault(word2, len(word_index))
w3 = word_index.setdefault(word3, len(word_index))
w1_w2 = w1 * word_count + w2
count[w1_w2,w3] += 1

If memory is simply not big enough, pybsddb can help store a disk-persistent map.

You could use a numpy multidimensional array. You'll need to use numbers rather than strings to index into the array, but that can be solved by using a single dict to map words to numbers.
import numpy
w = {'word1':1, 'word2':2, 'word3':3, 'word4':4}
a = numpy.zeros( (4,4,4) )
Then to index into your array, you'd do something like:
a[w[word1], w[word2], w[word3]] += 1
That syntax is not beautiful, but numpy arrays are about as efficient as anything you're likely to find. Note also that I haven't tried this code out, so I may be off in some of the details. Just going from memory here.

Here's a tree structure that uses the bisect library to maintain a sorted list of words. Each lookup in O(log2(n)).
import bisect
class WordList( object ):
"""Leaf-level is list of words and counts."""
def __init__( self ):
self.words= [ ('\xff-None-',0) ]
def count( self, wordTuple ):
assert len(wordTuple)==1
word= wordTuple[0]
loc= bisect.bisect_left( self.words, word )
if self.words[loc][0] != word:
self.words.insert( loc, (word,0) )
self.words[loc]= ( word, self.words[loc][1]+1 )
def getWords( self ):
return self.words[:-1]
class WordTree( object ):
"""Above non-leaf nodes are words and either trees or lists."""
def __init__( self ):
self.words= [ ('\xff-None-',None) ]
def count( self, wordTuple ):
head, tail = wordTuple[0], wordTuple[1:]
loc= bisect.bisect_left( self.words, head )
if self.words[loc][0] != head:
if len(tail) == 1:
newList= WordList()
else:
newList= WordTree()
self.words.insert( loc, (head,newList) )
self.words[loc][1].count( tail )
def getWords( self ):
return self.words[:-1]
t = WordTree()
for a in ( ('the','quick','brown'), ('the','quick','fox') ):
t.count(a)
for w1,wt1 in t.getWords():
print w1
for w2,wt2 in wt1.getWords():
print " ", w2
for w3 in wt2.getWords():
print " ", w3
For simplicity, this uses a dummy value in each tree and list. This saves endless if-statements to determine if the list was actually empty before we make a comparison. It's only empty once, so the if-statements are wasted for all n-1 other words.

You could put all words in a dictionary.
key would be word, and value is number (index).
Then you use it like this:
Word1=indexDict[word1]
Word2=indexDict[word2]
Word3=indexDict[word3]
topDictionary[Word1][Word2][Word3]
Insert in indexDict with:
if word not in indexDict:
indexDict[word]=len(indexDict)

Related

Python: update a frequency table (in the form of a list of lists)

I have two lists of US state abbreviations (for example):
s1=['CO','MA','IN','OH','MA','CA','OH','OH']
s2=['MA','FL','CA','GA','MA','OH']
What I want to end up with is this (basically an ordered frequency table):
S=[['CA',2],['CO',1],['FL',1],['GA',1],['IN',1],['MA',4],['OH',4]]
The way I came up with was:
s3=s1+s2
S=[[x,s3.count(x)] for x in set(s3)]
This works great - though, tbh, I don't know that this is very memory efficient.
BUT... there is a catch.
s1+s2
...is too big to hold in memory, so what I'm doing is appending to s1 until it reaches a length of 10K (yes, resources are THAT limited), then summarizing it (using the list comprehension step above), deleting the contents of s1, and re-filling s1 with the next chunk of data (only represented as 's2' above for purpose of demonstration). ...and so on through the loop until it reaches the end of the data.
So with each iteration of the loop, I want to sum the 'base' list of lists 'S' with the current iteration's list of lists 's'. My question is, essentially, how do I add these:
(the current master data):
S=[['CA',1],['CO',1],['IN',1],['MA',2],['OH',3]]
(the new data):
s=[['CA',1],['FL',1],['GA',1],['MA',2],['OH',1]]
...to get (the new master data):
S=[['CA',2],['CO',1],['FL',1],['GA',1],['IN',1],['MA',4],['OH',4]]
...in some sort of reasonably efficient way. If this is better to do with dictionaries or something else, I am fine with that. What I can't do, unfortunately is make use of ANY remotely specialized Python module -- all I have to work with is the most stripped-down version of Python 2.6 imaginable in a closed-off, locked-down, resource-poor Linux environment (hazards of the job). Any help is greatly appreciated!!

You can use itertools.chain to chain two iterators efficiently:
import itertools
import collections
counts = collections.Counter()
for val in itertools.chain(s1, s2): # memory efficient
counts[val] += 1
A collections.Counter object is a dict specialized for counting... if you know how to use a dict you can use a collections.Counter. However, it allows you to write the above more succinctly as:
counts = collections.Counter(itertools.chain(s1, s2))
Also note, the following construction:
S=[[x,s3.count(x)] for x in set(s3)]
Happens to also be very time inefficient, since you are calling s3.count in a loop. Although, this might not be too bad if len(set(s3)) << len(s3)
Note, you can do the chaining "manually" by doing something like:
it1 = iter(s1)
it2 = iter(s2)
for val in it1:
...
for val in it2:
...

You can run Counter.update as many times as you like, cutting your data to fit in memory / streaming them as you like.
import collections
counter = collections.Counter()
counter.update(['foo', 'bar'])
assert counter['foo'] == counter['bar'] == 1
counter.update(['foo', 'bar', 'foo'])
assert counter['foo'] == 3
assert counter['bar'] == 2
assert sorted(counter.items(), key=lambda rec: -rec[1]) == [('foo', 3), ('bar', 2)]
The last line uses negated count as the sorting key to make the higher counts come first.
If with that your count structure does not fit in memory, you need a (disk-based) database, such as Postgres, or likely just a machine with more memory and a more efficient key-value store, such as Redis.

Python: how to check that if an item is in a list efficiently?

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!

I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort:
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter:
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.

This uses list comprehension
words_in_line = [word for word in line if word in my_list]
which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.

There are two improvments you can make here.
Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
Using a more appropriate structure for your matching-word collection.
If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.
Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).
from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
# Do something with matched_word
print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()
input.txt
a b dog cat
c dog poop
maybe b cat
dog
Output
a
b
c
b

Python Puzzle code review(spoiler)

I have been working on the problems presented in Python Challenge. One of the problems asks to sift through a mess of characters and pick out the rarest character/s.
My methodology was to read the characters from a text file, store the characters/occurrence as a key/value pair in a dictionary. Sort the dictionary by value and invert the dictionary where the occurrence is the key and the string of characters is the value. Assuming that the rarest character occurs only once, I return the values where the key of this inverted dictionary equals one.
The input(funkymess.txt) is like this:
%%$#$^_#)^)&!_+]!*#&^}##%%+$&[(_#%+%$*^#$^!+]!&#)*}{}}!}]$[%}#[{##_^{*......
The code is as follows:
from operator import itemgetter
characterDict = dict()
#put the characters in a dictionary
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
if character in characterDict:
characterDict[character] = characterDict[character]+1
else:
characterDict[character] = 1
#Sort the character dictionary
def sortCharacterDictionary(characterDict):
sortCharDict = dict()
sortsortedDictionaryItems = sorted(characterDict.iteritems(),key = itemgetter(1))
for key, value in sortsortedDictionaryItems:
sortCharDict[key] = value
return sortCharDict
#invert the sorted character dictionary
def inverseSortedCharacterDictionary(sortedCharDict):
inv_map = dict()
for k, v in sortedCharDict.iteritems():
inv_map[v] = inv_map.get(v, [])
inv_map[v].append(k)
return inv_map
f = open('/Users/Developer/funkymess.txt','r')
for line in f:
#print line
processline = line.rstrip('\n')
putEncounteredCharactersInDictionary(processline)
f.close()
sortedCharachterDictionary = sortCharacterDictionary(characterDict)
#print sortedCharachterDictionary
inversedSortedCharacterDictionary = inverseSortedCharacterDictionary(sortedCharachterDictionary)
print inversedSortedCharacterDictionary[1]r
Can somebody take a look and provide me with some pointers on whether I am on the right track here and if possible provide some feedback on possible optimizations/best-practices and potential refactorings both from the language as well as from an algorithmic standpoint.
Thanks

Refactoring: A Walkthrough
I want to walk you through the process of refactoring. Learning to program is not just about knowing the end result, which is what you usually get when you ask a question on Stack Overflow. It's about how to get to that answer yourself. When people post short, dense answers to a question like this it's not always obvious how they arrived at their solutions.
So let's do some refactoring and see what we can do to simplify your code. We'll rewrite, delete, rename, and rearrange code until no more improvements can be made.
Simplify your algorithms
Python need not be so verbose. It is usually a code smell when you have explicit loops operating over lists and dicts in Python, rather than using list comprehensions and functions that operate on containers as a whole.
Use defaultdict to store character counts
A defaultdict(int) will generate entries when they are accessed if they do not exist. This let's us eliminate the if/else branch when counting characters.
from collections import defaultdict
characterDict = defaultdict(int)
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
characterDict[character] += 1
Sorting dicts
Dictionaries don't guarantee any ordering on their keys. You cannot assume that the items are stored in the same order that you insert them. So sorting the dict entries and then putting them right back into another dict just scrambles them right back up.
This means that your function is basically a no-op. After you sort the items you will need to keep them as a list of tuples to retain their sorting order. Removing that code we can then reduce this method down to a single line.
def sortCharacterDictionary(characterDict):
return sorted(characterDict.iteritems(), key=itemgetter(1))
Inverting dicts
Given the previous comment you won't actually have a dict any more after sorting. But assuming you did, this function is one of those cases where explicit looping is discouraged. In Python, always be thinking how you can operate over collections all at once rather than one item at a time.
def inverseSortedCharacterDictionary(sortedCharDict):
return dict((v, k) for k, v in sortedCharDict.iteritems())
All in one line we (1) iterate over the key/value pairs in the dict; (2) switch them and create inverted value/key tuples; (3) create a dict out of these inverted tuples.
Comment and name wisely
Your method names are long and descriptive. There's no need to repeat the same information in comments. Use comments only when your code isn't self-descriptive, such as when you have a complex algorithm or an unusual construct that isn't immediately obvious.
On the naming front, your names are unnecessarily long. I would stick with far less descriptive names, and also make them more generic. Instead of inverseSortedCharacterDictionary, try just invertedDict. That's all that method does, it inverts a dict. It doesn't actually matter if it's passed a sorted character dict or any other type of dict.
As a rule of thumb, try to use the most generic names possible so that your methods and variables can be as generic as possible. More generic means more reusable.
characters = defaultdict(int)
def countCharacters(string):
for ch in string:
characters[ch] += 1
def sortedCharacters(characters):
return sorted(characters.iteritems(), key=itemgetter(1))
def invertedDict(d):
return dict((v, k) for k, v in d.iteritems())
Reduce volume
Using temporary variables and helper methods is a good programming practice, and I applaud you for doing so in your program. However, now that we have them simple enough that each one is only one or two lines we probably don't even need them any more.
Here's your program body after changing the functions as above:
f = open('funkymess.txt', 'r')
for line in f:
countCharacters(line.rstrip('\n'))
f.close()
print sortedCharacters(characters)[0]
And then let's just go ahead and inline those helper methods since they're so simple. Here's the final program after all the refactoring:
Final program
#!/usr/bin/env python
from operator import itemgetter
from collections import defaultdict
characters = defaultdict(int)
f = open('funkymess.txt','r')
for line in f:
for ch in line.rstrip('\n'):
characters[ch] += 1
f.close()
print sorted(characters.iteritems(), key=itemgetter(1))[0]

You don't even need as much code as that, because Python already has a class that counts elements in an iterable for you! The following does all of what you asked for.
from collections import Counter
counter = Counter(open(<...>).read())
print min(counter, key=counter.get)
Explanation:
collections is a standard module in Python containing some commonly-used data structures. In particular, it contains Counter, which is a subclass of dict designed to count the frequency of stuff. It takes an iterable and counts all the characters in it.
Now as you may know, in Python strings are iterables and their elements are the single characters. So we can open the file, read all its contents at once, and feed that large string into a Counter. This makes a dict-like object which maps characters to their frequencies.
Finally, we want to find the least frequent charater, given this dictionary of their frequencies. In other words, we want the minimum element of counter, sorted by its value in the dictionary. Python has a built-in function for taking the minimum of things, naturally called min. If you want to sort the data by something, you can pass it an optional key argument and it will sort the list by key of that list. In this case, we ask min to find the minimum element as sorted by counter.get; in other words, we sort by its frequency!

That's way too much code.
[k for k, v in characterdict.iteritems()
if v = min(characterdict.items(), key=operator.itemgetter(1))[0]]
Optimize as desired (e.g. store the minimum in another variable first).

Here's the code that I used to solve this puzzle:
comment = open('comment.txt').read()
for c in sorted(set(comment)):
print ' %-3s %6d' % (repr(c)[1:-1], comment.count(c))
It sorts characters alphabetically rather than by frequency, but the rarest characters are very easy to pick up from the output.
If I wanted frequency sorting, I'd use collections.Counter like katrielalex suggested (if I remembered about its existence), or
from collections import defaultdict
comment = open('comment.txt').read()
counts = defaultdict(int)
for c in comment:
counts[c] += 1
for c in sorted(counts, key=counts.get):
print ' %-3s %6d' % (repr(c)[1:-1], counts[c])

Another way (not very compact) to accomplish your task:
text = """%$#$^_#)^)&!_+]!*#&^}##%%+$&[(_#%+%$*^#$^!+]!&#)*}{}}!}"""
chars = set(text)
L = [[c, text.count(c)] for c in chars]
L.sort(key=lambda sublist: sublist[1])
>>> L
[('(', 1),
('[', 1),
('{', 1),
('#', 2),
(']', 2),
(')', 3),
('*', 3),
('_', 3),
('&', 4),
('+', 4),
('!', 5),
('%', 5),
('$', 5),
('}', 5),
('^', 5),
('#', 6)]
>>>

data structure that can do a "select distinct X where W=w and Y=y and Z=z and ..." type lookup

I have a set of unique vectors (10k's worth). And I need to, for any chosen column, extract the set of values that are seen in that column, in rows where all the others columns are given values.
I'm hoping for a solution that is sub linear (wrt the item count) in time and at most linear (wrt the total size of all the items) in space, preferably sub linear extra space over just storing the items.
Can I get that or better?
BTW: it's going to be accessed from python and needs to simple to program or be part of an existing commonly used library.
edit: the costs are for the lookup, and do not include the time to build the structures. All the data that will ever be indexed is available before the first query will be made.
It seems I'm doing a bad job of describing what I'm looking for, so here is a solution that gets close:
class Index:
dep __init__(self, stuff): # don't care about this O() time
self.all = set(stuff)
self.index = {}
for item in stuff:
for i,v in item:
self.index.getdefault(i,set()).add(v)
def Get(self, col, have): # this O() matters
ret = []
t = array(have) # make a copy.
for i in self.index[col]:
t[col] = i
if t in self.all:
ret.append(i)
return ret
The problem is that this give really bad (O(n)) worst case perf.

Since you are asking for a SQL-like query, how about using a relational database? SQLite is part of the standard library, and can be used either on-disk or fully in memory.

If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.
If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N) is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.
Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).
So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.
For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the ith one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.
BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*) in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.
Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:
import collections
class Searchable(object):
def __init__(self, toindex=None):
self.toindex = toindex
self.data = []
self.indices = None
def makeindices(self):
if self.indices is not None:
return
self.indices = dict((i, collections.defaultdict(set))
for i in self.toindex)
def add(self, record):
if self.toindex is None:
self.toindex = range(len(record))
self.makeindices()
where = len(self.data)
self.data.append(record)
for i in self.toindex:
self.indices[i][record[i]].add(where)
def get(self, indices_and_values, indices_to_get):
ok = set(range(len(self.data)))
for i, v in indices_and_values:
ok.intersection_update(self.indices[i][v])
result = set()
for rec in (self.data[i] for i in ok):
t = tuple(rec[i] for i in indices_to_get)
result.add(t)
return result
def main():
c = Searchable()
for r in ((1,2,3), (1,2,4), (1,5,4)):
c.add(r)
print c.get([(0,1),(1,2)], [2])
main()
This prints
set([(3,), (4,)])
and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.

I'm assuming that you've tried the dictionary and you need something more flexible. Basically, what you need to do is index values of x, y and z
def build_index(vectors):
index = {x: {}, y: {}, z: {}}
for position, vector in enumerate(vectors):
if vector.x in index['x']:
index['x'][vector.x].append(position)
else:
index['x'][vector.x] = [position]
if vector.y in index['y']:
index['y'][vector.y].append(position)
else:
index['y'][vector.y] = [position]
if vector.z in index['z']:
index['z'][vector.z].append(position)
else:
index['z'][vector.z] = [position]
return index
What you have in index a lookup table. You can say, for example, select x,y,z from vectors where x=42 by doing this:
def query_by(vectors, index, property, value):
results = []
for i in index[property][value]:
results.append(vectors[i])
vecs_x_42 = query_by(index, 'x', 42)
# now vec_x_42 is a list of all vectors where x is 42
Now to do a logical conjunction, say select x,y,z from vectors where x=42 and y=3 you can use Python's sets to accomplish this:
def query_by(vectors, index, criteria):
sets = []
for k, v in criteria.iteritems():
if v not in index[k]:
return []
sets.append(index[k][v])
results = []
for i in set.intersection(*sets):
results.append(vectors[i])
return results
vecs_x_42_y_3 = query_by(index, {'x': 42, 'y': 3})
The intersection operation on sets produces values which only appear in both sets, so you are only iterating the positions which satisfy both conditions.
Now for the last part of your question, to group by x:
def group_by(vectors, property):
result = {}
for v in vectors:
value = getattr(v, property)
if value in result:
result[value].append(v)
else:
result[value] = [v]
return result
So let's bring it all together:
vectors = [...] # your vectors, as objects such that v.x, v.y produces the x and y values
index = build_index(vectors)
my_vectors = group_by(query_by(vectors, index, {'y':42, 'z': 3}), 'x')
# now you have in my_vectors a dictionary of vectors grouped by x value, where y=42 and z=3
Update
I updated the code above and fixed a few obvious errors. It works now and it does what it claims to do. On my laptop, a 2GHz core 2 duo with 4GB RAM, it takes less than 1s to build_index. Lookups are very quick, even when the dataset has 100k vectors. If I have some time I'll do some formal comparisons against MySQL.
You can see the full code at this Codepad, if you time it or improve it, let me know.

Suppose you have a 'tuple' class with fields x, y, and z and you have a bunch of such tuples saved in an enumerable var named myTuples. Then:
A) Pre-population:
dct = {}
for tpl in myTuples:
tmp = (tpl.y, tpl.z)
if tmp in dct:
dct[tmp].append(tpl.x)
else:
dct[tmp] = [tpl.x]
B) Query:
def findAll(y,z):
tmp = (y,z)
if tmp not in dct: return ()
return [(x,y,z) for x in dct[tmp]]
I am sure that there is a way to optimize the code for readability, save a few cycles, etc. But essentially you want to pre-populate a dict, using a 2-tuple as a key. If I did not see a request for sub-linear then I would not have though of this :)
A) The pre-population is linear, sorry.
B) Query should be as slow as the number of items returned - most of the time sub-linear, except for weird edge cases.

So you have 3 coordinates and one value for start and end of vector (x,y,z)?
How is it possible to know the seven known values? Are there many coordinate triples multiple times?
You must be doing very tight loop with the function to be so conserned of look up time considering the small size of data (10K).
Could you give example of real input for your class you posted?

Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList

def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.

Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.

Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.