Python Puzzle code review(spoiler) - python

I have been working on the problems presented in Python Challenge. One of the problems asks to sift through a mess of characters and pick out the rarest character/s.
My methodology was to read the characters from a text file, store the characters/occurrence as a key/value pair in a dictionary. Sort the dictionary by value and invert the dictionary where the occurrence is the key and the string of characters is the value. Assuming that the rarest character occurs only once, I return the values where the key of this inverted dictionary equals one.
The input(funkymess.txt) is like this:
%%$#$^_#)^)&!_+]!*#&^}##%%+$&[(_#%+%$*^#$^!+]!&#)*}{}}!}]$[%}#[{##_^{*......
The code is as follows:
from operator import itemgetter
characterDict = dict()
#put the characters in a dictionary
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
if character in characterDict:
characterDict[character] = characterDict[character]+1
else:
characterDict[character] = 1
#Sort the character dictionary
def sortCharacterDictionary(characterDict):
sortCharDict = dict()
sortsortedDictionaryItems = sorted(characterDict.iteritems(),key = itemgetter(1))
for key, value in sortsortedDictionaryItems:
sortCharDict[key] = value
return sortCharDict
#invert the sorted character dictionary
def inverseSortedCharacterDictionary(sortedCharDict):
inv_map = dict()
for k, v in sortedCharDict.iteritems():
inv_map[v] = inv_map.get(v, [])
inv_map[v].append(k)
return inv_map
f = open('/Users/Developer/funkymess.txt','r')
for line in f:
#print line
processline = line.rstrip('\n')
putEncounteredCharactersInDictionary(processline)
f.close()
sortedCharachterDictionary = sortCharacterDictionary(characterDict)
#print sortedCharachterDictionary
inversedSortedCharacterDictionary = inverseSortedCharacterDictionary(sortedCharachterDictionary)
print inversedSortedCharacterDictionary[1]r
Can somebody take a look and provide me with some pointers on whether I am on the right track here and if possible provide some feedback on possible optimizations/best-practices and potential refactorings both from the language as well as from an algorithmic standpoint.
Thanks

Refactoring: A Walkthrough
I want to walk you through the process of refactoring. Learning to program is not just about knowing the end result, which is what you usually get when you ask a question on Stack Overflow. It's about how to get to that answer yourself. When people post short, dense answers to a question like this it's not always obvious how they arrived at their solutions.
So let's do some refactoring and see what we can do to simplify your code. We'll rewrite, delete, rename, and rearrange code until no more improvements can be made.
Simplify your algorithms
Python need not be so verbose. It is usually a code smell when you have explicit loops operating over lists and dicts in Python, rather than using list comprehensions and functions that operate on containers as a whole.
Use defaultdict to store character counts
A defaultdict(int) will generate entries when they are accessed if they do not exist. This let's us eliminate the if/else branch when counting characters.
from collections import defaultdict
characterDict = defaultdict(int)
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
characterDict[character] += 1
Sorting dicts
Dictionaries don't guarantee any ordering on their keys. You cannot assume that the items are stored in the same order that you insert them. So sorting the dict entries and then putting them right back into another dict just scrambles them right back up.
This means that your function is basically a no-op. After you sort the items you will need to keep them as a list of tuples to retain their sorting order. Removing that code we can then reduce this method down to a single line.
def sortCharacterDictionary(characterDict):
return sorted(characterDict.iteritems(), key=itemgetter(1))
Inverting dicts
Given the previous comment you won't actually have a dict any more after sorting. But assuming you did, this function is one of those cases where explicit looping is discouraged. In Python, always be thinking how you can operate over collections all at once rather than one item at a time.
def inverseSortedCharacterDictionary(sortedCharDict):
return dict((v, k) for k, v in sortedCharDict.iteritems())
All in one line we (1) iterate over the key/value pairs in the dict; (2) switch them and create inverted value/key tuples; (3) create a dict out of these inverted tuples.
Comment and name wisely
Your method names are long and descriptive. There's no need to repeat the same information in comments. Use comments only when your code isn't self-descriptive, such as when you have a complex algorithm or an unusual construct that isn't immediately obvious.
On the naming front, your names are unnecessarily long. I would stick with far less descriptive names, and also make them more generic. Instead of inverseSortedCharacterDictionary, try just invertedDict. That's all that method does, it inverts a dict. It doesn't actually matter if it's passed a sorted character dict or any other type of dict.
As a rule of thumb, try to use the most generic names possible so that your methods and variables can be as generic as possible. More generic means more reusable.
characters = defaultdict(int)
def countCharacters(string):
for ch in string:
characters[ch] += 1
def sortedCharacters(characters):
return sorted(characters.iteritems(), key=itemgetter(1))
def invertedDict(d):
return dict((v, k) for k, v in d.iteritems())
Reduce volume
Using temporary variables and helper methods is a good programming practice, and I applaud you for doing so in your program. However, now that we have them simple enough that each one is only one or two lines we probably don't even need them any more.
Here's your program body after changing the functions as above:
f = open('funkymess.txt', 'r')
for line in f:
countCharacters(line.rstrip('\n'))
f.close()
print sortedCharacters(characters)[0]
And then let's just go ahead and inline those helper methods since they're so simple. Here's the final program after all the refactoring:
Final program
#!/usr/bin/env python
from operator import itemgetter
from collections import defaultdict
characters = defaultdict(int)
f = open('funkymess.txt','r')
for line in f:
for ch in line.rstrip('\n'):
characters[ch] += 1
f.close()
print sorted(characters.iteritems(), key=itemgetter(1))[0]

You don't even need as much code as that, because Python already has a class that counts elements in an iterable for you! The following does all of what you asked for.
from collections import Counter
counter = Counter(open(<...>).read())
print min(counter, key=counter.get)
Explanation:
collections is a standard module in Python containing some commonly-used data structures. In particular, it contains Counter, which is a subclass of dict designed to count the frequency of stuff. It takes an iterable and counts all the characters in it.
Now as you may know, in Python strings are iterables and their elements are the single characters. So we can open the file, read all its contents at once, and feed that large string into a Counter. This makes a dict-like object which maps characters to their frequencies.
Finally, we want to find the least frequent charater, given this dictionary of their frequencies. In other words, we want the minimum element of counter, sorted by its value in the dictionary. Python has a built-in function for taking the minimum of things, naturally called min. If you want to sort the data by something, you can pass it an optional key argument and it will sort the list by key of that list. In this case, we ask min to find the minimum element as sorted by counter.get; in other words, we sort by its frequency!

That's way too much code.
[k for k, v in characterdict.iteritems()
if v = min(characterdict.items(), key=operator.itemgetter(1))[0]]
Optimize as desired (e.g. store the minimum in another variable first).

Here's the code that I used to solve this puzzle:
comment = open('comment.txt').read()
for c in sorted(set(comment)):
print ' %-3s %6d' % (repr(c)[1:-1], comment.count(c))
It sorts characters alphabetically rather than by frequency, but the rarest characters are very easy to pick up from the output.
If I wanted frequency sorting, I'd use collections.Counter like katrielalex suggested (if I remembered about its existence), or
from collections import defaultdict
comment = open('comment.txt').read()
counts = defaultdict(int)
for c in comment:
counts[c] += 1
for c in sorted(counts, key=counts.get):
print ' %-3s %6d' % (repr(c)[1:-1], counts[c])

Another way (not very compact) to accomplish your task:
text = """%$#$^_#)^)&!_+]!*#&^}##%%+$&[(_#%+%$*^#$^!+]!&#)*}{}}!}"""
chars = set(text)
L = [[c, text.count(c)] for c in chars]
L.sort(key=lambda sublist: sublist[1])
>>> L
[('(', 1),
('[', 1),
('{', 1),
('#', 2),
(']', 2),
(')', 3),
('*', 3),
('_', 3),
('&', 4),
('+', 4),
('!', 5),
('%', 5),
('$', 5),
('}', 5),
('^', 5),
('#', 6)]
>>>

Related

Python: Create sorted list of keys moving one key to the head

Is there a more pythonic way of obtaining a sorted list of dictionary keys with one key moved to the head? So far I have this:
# create a unique list of keys headed by 'event' and followed by a sorted list.
# dfs is a dict of dataframes.
for k in (dict.fromkeys(['event']+sorted(dfs))):
display(k,dfs[k]) # ideally this should be (k,v)
I suppose you would be able to do
for k, v in list(dfs.items()) + [('event', None)]:
.items() casts a dictionary to a list of tuples (or technically a dict_items, which is why I have to cast it to list explicitly to append), to which you can append a second list. Iterating through a list of tuples allows for automatic unpacking (so you can do k,v in list instead of tup in list)
What we really want is an iterable, but that's not possible with sorted, because it must see all the keys before it knows what the first item should be.
Using dict.fromkeys to create a blank dictionary by insertion order was pretty clever, but relies on an implementation detail of the current version of python. (dict is fundamentally unordered) I admit, it took me a while to figure out that line.
Since the code you posted is just working with the keys, I suggest you focus on that. Taking up a few more lines for readability is a good thing, especially if we can hide it in a testable function:
def display_by_keys(dfs, priority_items=None):
if not priority_items:
priority_items = ['event']
featured = {k for k in priority_items if k in dfs}
others = {k for k in dfs.keys() if k not in featured}
for key in list(featured) + sorted(others):
display(key, dfs[key])
The potential downside is you must sort the keys every time. If you do this much more often than the data store changes, on a large data set, that's a potential concern.
Of course you wouldn't be displaying a really large result, but if it becomes a problem, then you'll want to store them in a collections.OrderedDict (https://stackoverflow.com/a/13062357/1766544) or find a sorteddict module.
from collections import OrderedDict
# sort once
ordered_dfs = OrderedDict.fromkeys(sorted(dfs.keys()))
ordered_dfs.move_to_end('event', last=False)
ordered_dfs.update(dfs)
# display as often as you need
for k, v in ordered_dfs.items():
print (k, v)
If you display different fields first in different views, that's not a problem. Just sort all the fields normally, and use a function like the one above, without the sort.

What is the difference between the solution that uses defaultdict and the one that uses setdefault?

In Think Python the author introduces defaultdict. The following is an excerpt from the book regarding defaultdict:
If you are making a dictionary of lists, you can often write simpler
code using defaultdict. In my solution to Exercise 12-2, which you can
get from http://thinkpython2.com/code/anagram_sets.py, I make a
dictionary that maps from a sorted string of letters to the list of
words that can be spelled with those letters. For example, 'opst' maps
to the list ['opts', 'post', 'pots', 'spot', 'stop', 'tops']. Here’s
the original code:
def all_anagrams(filename):
d = {}
for line in open(filename):
word = line.strip().lower()
t = signature(word)
if t not in d:
d[t] = [word]
else:
d[t].append(word) return d
This can be simplified using setdefault, which you might have used in Exercise 11-2:
def all_anagrams(filename):
d = {}
for line in open(filename):
word = line.strip().lower()
t = signature(word)
d.setdefault(t, []).append(word)
return d
This solution has the drawback that it makes a new list every time, regardless of whether it is needed. For lists, that’s no big deal, but if the factory function is complicated, it might be. We can avoid this problem and simplify
the code using a defaultdict:
def all_anagrams(filename):
d = defaultdict(list)
for line in open(filename):
word = line.strip().lower()
t = signature(word)
d[t].append(word)
return d
Here's the definition of signature function:
def signature(s):
"""Returns the signature of this string.
Signature is a string that contains all of the letters in order.
s: string
"""
# TODO: rewrite using sorted()
t = list(s)
t.sort()
t = ''.join(t)
return t
What I understand regarding the second solution is that setdefault checks whether t (the signature of the word) exists as a key, if not, it sets it as a key and sets an empty list as its value, then append appends the word to it. If t exists, setdefault returns its value (a list with at least one item, which is a string representing a word), and append appends the word to this list.
What I understand regarding the third solution is that d, which represents a defaultdict, makes t a key and sets an empty list as its value (if t doesn't already exist as a key), then the word is appended to the list. If t does already exist, its value (the list) is returned, and to which the word is appended.
What is the difference between the second and third solutions? I What it means that the code in the second solution makes a new list every time, regardless of whether it's needed? How is setdefault responsible for that? How does using defaultdict make us avoid this problem? How are the second and third solutions different?
The "makes a new list every time" means everytime setdefault(t, []) is called, a new empty list (the [] argument) is created to be the default value just in case it's needed. Using a defaultdict avoids the need for doing that.
Although both solutions return a dictionary, the one using defaultdict is actually returning a defaultdict(list) which is a subclass of the built-in dict class. This normally is not a problem. The most notable effect will likely be if you print() the returned object, as the output from the two looks quite different.
If you don't want that for whatever reason, you can change the last statement of the function to:
return dict(d)
to convert the defaultdict(list) created into a regular dict.

Right way to initialize an OrderedDict using its constructor such that it retains order of initial data?

What's the correct way to initialize an ordered dictionary (OD) so that it retains the order of initial data?
from collections import OrderedDict
# Obviously wrong because regular dict loses order
d = OrderedDict({'b':2, 'a':1})
# An OD is represented by a list of tuples, so would this work?
d = OrderedDict([('b',2), ('a', 1)])
# What about using a list comprehension, will 'd' preserve the order of 'l'
l = ['b', 'a', 'c', 'aa']
d = OrderedDict([(i,i) for i in l])
Question:
Will an OrderedDict preserve the order of a list of tuples, or tuple of tuples or tuple of lists or list of lists etc. passed at the time of initialization (2nd & 3rd example above)?
How does one go about verifying if OrderedDict actually maintains an order? Since a dict has an unpredictable order, what if my test vectors luckily have the same initial order as the unpredictable order of a dict? For example, if instead of d = OrderedDict({'b':2, 'a':1}) I write d = OrderedDict({'a':1, 'b':2}), I can wrongly conclude that the order is preserved. In this case, I found out that a dict is ordered alphabetically, but that may not be always true. What's a reliable way to use a counterexample to verify whether a data structure preserves order or not, short of trying test vectors repeatedly until one breaks?
P.S. I'll just leave this here for reference: "The OrderedDict constructor and update() method both accept keyword arguments, but their order is lost because Python’s function call semantics pass-in keyword arguments using a regular unordered dictionary"
P.P.S : Hopefully, in future, OrderedDict will preserve the order of kwargs also (example 1): http://bugs.python.org/issue16991
The OrderedDict will preserve any order that it has access to. The only way to pass ordered data to it to initialize is to pass a list (or, more generally, an iterable) of key-value pairs, as in your last two examples. As the documentation you linked to says, the OrderedDict does not have access to any order when you pass in keyword arguments or a dict argument, since any order there is removed before the OrderedDict constructor sees it.
Note that using a list comprehension in your last example doesn't change anything. There's no difference between OrderedDict([(i,i) for i in l]) and OrderedDict([('b', 'b'), ('a', 'a'), ('c', 'c'), ('aa', 'aa')]). The list comprehension is evaluated and creates the list and it is passed in; OrderedDict knows nothing about how it was created.
# An OD is represented by a list of tuples, so would this work?
d = OrderedDict([('b', 2), ('a', 1)])
Yes, that will work. By definition, a list is always ordered the way it is represented. This goes for list-comprehension too, the list generated is in the same way the data was provided (i.e. source from a list it will be deterministic, sourced from a set or dict not so much).
How does one go about verifying if OrderedDict actually maintains an order. Since a dict has an unpredictable order, what if my test vectors luckily has the same initial order as the unpredictable order of a dict?. For example, if instead of d = OrderedDict({'b':2, 'a':1}) I write d = OrderedDict({'a':1, 'b':2}), I can wrongly conclude that the order is preserved. In this case, I found out that a dict is order alphabetically, but that may not be always true. i.e. what's a reliable way to use a counter example to verify if a data structure preserves order or not short of trying test vectors repeatedly until one breaks.
You keep your source list of 2-tuple around for reference, and use that as your test data for your test cases when you do unit tests. Iterate through them and ensure the order is maintained.
It is also possible (and a little more efficient) to use a generator expression:
d = OrderedDict((i, i) for i in l)
Obviously, the benefit is negligible in this trivial case for l, but if l corresponds to an iterator or was yielding results from a generator, e.g. used to parse and iterate through a large file, then the difference could be very substantial (e.g. avoiding to load the entire contents onto memory). For example:
def mygen(filepath):
with open(filepath, 'r') as f:
for line in f:
yield [int(field) for field line.split()]
d = OrderedDict((i, sum(numbers)) for i, numbers in enumerate(mygen(filepath)))

Syntax of Lists in Python

I am learning python, now, i came across a code snippet which looks like this:
my_name={'sujit','amit','ajit','arijit'}
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
t
But when I modify the code as:
my_name=['sujit','amit','ajit','arijit']
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
j
What is the difference between {} and []? The [] is giving me the desired result for printing the ith character of the current name from the list. Bu the use of {} is not.
{} creates a set, whereas [] creates a list. The key differences are:
the list preserves the order, whereas the set does not;
the list preserves duplicates, whereas the set does not;
the list can be accessed through indexing (i.e. l[5]), whereas the set can not.
The first point holds the key to your puzzle. When you use a list, the loop iterates over the names in order. When you're using a set, the loop iterates over the elements in an unspecified order, which in my Python interpreter happens to be sujit, amit, arijit, ajit.
P.S. {} can also be used to create a dictionary: {'a':1, 'b':2, 'c':3}.
The {} notation is set notation rather than list notation. That is basically the same as a list, but the items are stored in a jumbled up order, and duplicate elements are removed. (To make things even more confusing, {} is also dictionary syntax, but only when you use colons to separate keys and values -- the way you are using it, is a set.)
Secondly, you aren't using enumerate properly. (Or maybe you are, but I'm not sure...)
enumerate gives you corresponding index and value pairs. So enumerate(['sujit','amit','ajit','arijit']) gives you:
[(0, 'sujit'), (1, 'amit'), (2, 'ajit'), (3, 'arijit')]
So this will get you the first letter of "sujit", the second letter of "amit", and so on. Is that what you wanted?
{} do not enclose a list. They do not enclose any kind of sequence; they enclose (when used this way) a set (in the mathematical sense). The elements of a set do not have a specified order, so you get them enumerated in whatever order Python put them in. (It does this so that it can efficiently ensure the other important constraint on sets: they cannot contain a duplicate value).
This is specific to Python 3. In 2.x, {} cannot be used to create a set, but only to create a dict. (This also works in Python 3.) To do this, you specify the key-value pairs separated by colons, thus: {'sujit': 'amit', 'ajit': 'arijit'}.
(Also, a general note: if you say "question" instead everywhere that you currently say "doubt", you will be wrong much less often, at least per the standards of English as spoken outside of India. I don't particularly understand how the overuse of 'doubt' has become so common in English as spoken by those from India, but I've seen it in many places across the Internet...)
sets do not preserve order:
[] is a list:
>>> print ['sujit','amit','ajit','arijit']
['sujit', 'amit', 'ajit', 'arijit']
{} is a set:
>>> print {'sujit','amit','ajit','arijit'}
set(['sujit', 'amit', 'arijit', 'ajit'])
So you get s,m,i,j in the first case; s,m,i,t in the second.

Memory Efficient Alternatives to Python Dictionaries

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.
This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.
My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.
From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.
So, does anyone have any suggestions for me?
Edited to add:
Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).
The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.
Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:
295760 S. Lott's answer
237984 S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156 d[a][b][c] = int(freq)
189132 keys.append((a,b,c)); freqs.append(int(freq))
146132 d[intern(a),intern(b)][intern(c)] = int(freq)
145408 d[intern(a)][intern(b)][intern(c)] = int(freq)
83888 [*] d[a+' '+b+' '+c] = int(freq)
82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
68756 keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
60320 keys.append(a+' '+b+' '+c); freqs.append(int(freq))
50556 pair array
48320 squeezed pair array
33024 squeezed single array
The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)
'Pair array' is the scheme below in my original answer ("I'd start with the array with keys
being the first two words..."), where the value table for each pair is
represented as a single string. 'Squeezed pair array' is the same,
leaving out the frequency values that are equal to 1 (the most common
case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:
import collections
def build(file):
pairs = collections.defaultdict(list)
for line in file: # N.B. file assumed to be already sorted
a, b, c, freq = line.split()
key = ' '.join((a, b))
pairs[key].append(c + ':' + freq if freq != '1' else c)
out = open('squeezedsinglearrayfile', 'w')
for key in sorted(pairs.keys()):
out.write('%s|%s\n' % (key, ' '.join(pairs[key])))
def load():
return open('squeezedsinglearrayfile').readlines()
if __name__ == '__main__':
build(open('freqs'))
I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.
Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.
Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.
I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)
Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.
d = {}
d[ word1, word2, word3 ] = 1
Also as a plus, you could use defaultdict
so that elements that don't have entries always return 0
and so that u can say d[w1,w2,w3] += 1 without checking if the key already exists or not
example:
from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1
If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension
if you have a tuple, t, you can get the first two items using slices:
>>> a = (1,2,3)
>>> a[:2]
(1, 2)
a small example for searching tuples with list comprehensions:
>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]
You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)
In this case, ZODB¹ BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.
Something like:
from BTrees.OOBTree import OOBTree as BTree
The interface is, more or less, dict-like, with the added bonus (for you) that .keys, .items, .iterkeys and .iteritems have two min, max optional arguments:
>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]
¹ Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.
PS They exist in the CheeseShop ☺
A couple attempts:
I figure you're doing something similar to this:
from __future__ import with_statement
import time
from collections import deque, defaultdict
# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
d=deque()
with open(words) as f:
for i in range(3):
d.append(f.readline().strip())
while d[-1] != '':
yield tuple(d)
d.popleft()
d.append(f.readline().strip())
if __name__ == '__main__':
class D(dict):
def __missing__(self, key):
self[key] = D()
return self[key]
h=D()
for a, b, c in triplegen():
h[a][b][c] = 1
time.sleep(60)
That gives me ~88MB.
Changing the storage to
h[a, b, c] = 1
takes ~25MB
interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.
Are you implementing Markovian text generation?
If your chains map 2 words to the probabilities of the third I'd use a dictionary mapping K-tuples to the 3rd-word histogram. A trivial (but memory-hungry) way to implement the histogram would be to use a list with repeats, and then random.choice gives you a word with the proper probability.
Here's an implementation with the K-tuple as a parameter:
import random
# can change these functions to use a dict-based histogram
# instead of a list with repeats
def default_histogram(): return []
def add_to_histogram(item, hist): hist.append(item)
def choose_from_histogram(hist): return random.choice(hist)
K=2 # look 2 words back
words = ...
d = {}
# build histograms
for i in xrange(len(words)-K-1):
key = words[i:i+K]
word = words[i+K]
d.setdefault(key, default_histogram())
add_to_histogram(word, d[key])
# generate text
start = random.randrange(len(words)-K-1)
key = words[start:start+K]
for i in NUM_WORDS_TO_GENERATE:
word = choose_from_histogram(d[key])
print word,
key = key[1:] + (word,)
You could try to use same dictionary, only one level deep.
topDictionary[word1+delimiter+word2+delimiter+word3]
delimiter could be plain " ". (or use (word1,word2,word3))
This would be easiest to implement.
I believe you will see a little improvement, if it is not enough...
...i'll think of something...
Ok, so you are basically trying to store a sparse 3D space. The kind of access patterns you want to this space is crucial for the choice of algorithm and data structure. Considering your data source, do you want to feed this to a grid? If you don't need O(1) access:
In order to get memory efficiency you want to subdivide that space into subspaces with a similar number of entries. (like a BTree). So a data structure with :
firstWordRange
secondWordRange
thirdWordRange
numberOfEntries
a sorted block of entries.
next and previous blocks in all 3 dimensions
Scipy has sparse matrices, so if you can make the first two words a tuple, you can do something like this:
import numpy as N
from scipy import sparse
word_index = {}
count = sparse.lil_matrix((word_count*word_count, word_count), dtype=N.int)
for word1, word2, word3 in triple_list:
w1 = word_index.setdefault(word1, len(word_index))
w2 = word_index.setdefault(word2, len(word_index))
w3 = word_index.setdefault(word3, len(word_index))
w1_w2 = w1 * word_count + w2
count[w1_w2,w3] += 1
If memory is simply not big enough, pybsddb can help store a disk-persistent map.
You could use a numpy multidimensional array. You'll need to use numbers rather than strings to index into the array, but that can be solved by using a single dict to map words to numbers.
import numpy
w = {'word1':1, 'word2':2, 'word3':3, 'word4':4}
a = numpy.zeros( (4,4,4) )
Then to index into your array, you'd do something like:
a[w[word1], w[word2], w[word3]] += 1
That syntax is not beautiful, but numpy arrays are about as efficient as anything you're likely to find. Note also that I haven't tried this code out, so I may be off in some of the details. Just going from memory here.
Here's a tree structure that uses the bisect library to maintain a sorted list of words. Each lookup in O(log2(n)).
import bisect
class WordList( object ):
"""Leaf-level is list of words and counts."""
def __init__( self ):
self.words= [ ('\xff-None-',0) ]
def count( self, wordTuple ):
assert len(wordTuple)==1
word= wordTuple[0]
loc= bisect.bisect_left( self.words, word )
if self.words[loc][0] != word:
self.words.insert( loc, (word,0) )
self.words[loc]= ( word, self.words[loc][1]+1 )
def getWords( self ):
return self.words[:-1]
class WordTree( object ):
"""Above non-leaf nodes are words and either trees or lists."""
def __init__( self ):
self.words= [ ('\xff-None-',None) ]
def count( self, wordTuple ):
head, tail = wordTuple[0], wordTuple[1:]
loc= bisect.bisect_left( self.words, head )
if self.words[loc][0] != head:
if len(tail) == 1:
newList= WordList()
else:
newList= WordTree()
self.words.insert( loc, (head,newList) )
self.words[loc][1].count( tail )
def getWords( self ):
return self.words[:-1]
t = WordTree()
for a in ( ('the','quick','brown'), ('the','quick','fox') ):
t.count(a)
for w1,wt1 in t.getWords():
print w1
for w2,wt2 in wt1.getWords():
print " ", w2
for w3 in wt2.getWords():
print " ", w3
For simplicity, this uses a dummy value in each tree and list. This saves endless if-statements to determine if the list was actually empty before we make a comparison. It's only empty once, so the if-statements are wasted for all n-1 other words.
You could put all words in a dictionary.
key would be word, and value is number (index).
Then you use it like this:
Word1=indexDict[word1]
Word2=indexDict[word2]
Word3=indexDict[word3]
topDictionary[Word1][Word2][Word3]
Insert in indexDict with:
if word not in indexDict:
indexDict[word]=len(indexDict)

Categories