Custom Sort Complicated Strings in Python - python

I have a list of filenames conforming to the pattern: s[num][alpha1][alpha2].ext
I need to sort, first by the number, then by alpha1, then by alpha2. The last two aren't alphabetical, however, but rather should reflect a custom ordering.
I've created two lists representing the ordering for alpha1 and alpha2, like so:
alpha1Order = ["Fizz", "Buzz", "Ipsum", "Dolor", "Lorem"]
alpha2Order = ["Sit", "Amet", "Test"]
What's the best way to proceed? My first though was to tokenize (somehow) such that I split each filename into its component parts (s, num, alpha1, alpha2), then sort, but I wasn't quite sure how to perform such a complicated sort. Using a key function seemed clunky, as this sort didn't seem to lend itself to a simple ordering.

Once tokenized, your data is perfectly orderable with a key function. Just return the index of the alpha1Order and alpha2Order lists for the value. Replace them with dictionaries to make the lookup easier:
alpha1Order = {token: i for i, token in enumerate(alpha1Order)}
alpha2Order = {token: i for i, token in enumerate(alpha2Order)}
def keyfunction(filename):
num, alpha1, alpha2 = tokenize(filename)
return int(num), alpha1Order[alpha1], alpha2Order[alpha2]
This returns a tuple to sort on; Python will use the first value to sort on, ordering anything that has the same int(num) value by the second entry, using the 3rd to break any values tied on the first 2 entries.

Related

Tuple-key dictionary in python: Accessing a whole block of entries

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

Django filter by condition

I have a queryset that I want to paginate through alphabetically.
employees = Employee.nodes.order_by('name')
I want to compare the first letter of the employee's name name[0] to the letter that I am iterating on. - but I don't know how to filter based on conditions applied to my attribute.
employees_by_letter = []
for letter in alphabet:
employees_by_this_letter = employees.filter(name[0].lower()=letter)
employees_by_letter.append(employees_by_this_letter)
"""error -- SyntaxError: keyword can't be an expression"""
I suppose I could iterate through each employee object and append a value for their first letter... but there has to be a better way.
Well this is Python, and in Python parameter names are identifiers. You can not drop some sort of expression into it.
Django has however some ways to do advanced filtering. In your case, you want to use the __istartswith filter:
employees_by_letter = [
employees.filter(name__istartswith=letter)
for letter in alphabet
]
This is a list comprehension that will generate such that for every letter in alphabet, the corresponding queryset is in the list.
Note however that since you eventually will fetch every element, I would actually recommend iterating (or performing a groupby for example).
Like:
from django.db.models.functions import Lower
fromiteratools import groupby
employees_by_letter = {
k: list(v)
for k, v in groupby(
Employee.annotate(lowname=Lower('name')).nodes.order_by('lowname'),
lambda x: x.lowname[:1]
)
}
this will construct a dictionary with as key the lowercase letter (or an empty string, if there are strings with an empty name), and these all map to lists of Employee instances: the Employees with a name that start with that letter. So that means we perform only a single query on the database. Django querysets are however lazy, so in case you plan to only really fetch a few of the querysets, then the former can be more efficient.

What is the difference between the solution that uses defaultdict and the one that uses setdefault?

In Think Python the author introduces defaultdict. The following is an excerpt from the book regarding defaultdict:
If you are making a dictionary of lists, you can often write simpler
code using defaultdict. In my solution to Exercise 12-2, which you can
get from http://thinkpython2.com/code/anagram_sets.py, I make a
dictionary that maps from a sorted string of letters to the list of
words that can be spelled with those letters. For example, 'opst' maps
to the list ['opts', 'post', 'pots', 'spot', 'stop', 'tops']. Here’s
the original code:
def all_anagrams(filename):
d = {}
for line in open(filename):
word = line.strip().lower()
t = signature(word)
if t not in d:
d[t] = [word]
else:
d[t].append(word) return d
This can be simplified using setdefault, which you might have used in Exercise 11-2:
def all_anagrams(filename):
d = {}
for line in open(filename):
word = line.strip().lower()
t = signature(word)
d.setdefault(t, []).append(word)
return d
This solution has the drawback that it makes a new list every time, regardless of whether it is needed. For lists, that’s no big deal, but if the factory function is complicated, it might be. We can avoid this problem and simplify
the code using a defaultdict:
def all_anagrams(filename):
d = defaultdict(list)
for line in open(filename):
word = line.strip().lower()
t = signature(word)
d[t].append(word)
return d
Here's the definition of signature function:
def signature(s):
"""Returns the signature of this string.
Signature is a string that contains all of the letters in order.
s: string
"""
# TODO: rewrite using sorted()
t = list(s)
t.sort()
t = ''.join(t)
return t
What I understand regarding the second solution is that setdefault checks whether t (the signature of the word) exists as a key, if not, it sets it as a key and sets an empty list as its value, then append appends the word to it. If t exists, setdefault returns its value (a list with at least one item, which is a string representing a word), and append appends the word to this list.
What I understand regarding the third solution is that d, which represents a defaultdict, makes t a key and sets an empty list as its value (if t doesn't already exist as a key), then the word is appended to the list. If t does already exist, its value (the list) is returned, and to which the word is appended.
What is the difference between the second and third solutions? I What it means that the code in the second solution makes a new list every time, regardless of whether it's needed? How is setdefault responsible for that? How does using defaultdict make us avoid this problem? How are the second and third solutions different?
The "makes a new list every time" means everytime setdefault(t, []) is called, a new empty list (the [] argument) is created to be the default value just in case it's needed. Using a defaultdict avoids the need for doing that.
Although both solutions return a dictionary, the one using defaultdict is actually returning a defaultdict(list) which is a subclass of the built-in dict class. This normally is not a problem. The most notable effect will likely be if you print() the returned object, as the output from the two looks quite different.
If you don't want that for whatever reason, you can change the last statement of the function to:
return dict(d)
to convert the defaultdict(list) created into a regular dict.

data structure that can do a "select distinct X where W=w and Y=y and Z=z and ..." type lookup

I have a set of unique vectors (10k's worth). And I need to, for any chosen column, extract the set of values that are seen in that column, in rows where all the others columns are given values.
I'm hoping for a solution that is sub linear (wrt the item count) in time and at most linear (wrt the total size of all the items) in space, preferably sub linear extra space over just storing the items.
Can I get that or better?
BTW: it's going to be accessed from python and needs to simple to program or be part of an existing commonly used library.
edit: the costs are for the lookup, and do not include the time to build the structures. All the data that will ever be indexed is available before the first query will be made.
It seems I'm doing a bad job of describing what I'm looking for, so here is a solution that gets close:
class Index:
dep __init__(self, stuff): # don't care about this O() time
self.all = set(stuff)
self.index = {}
for item in stuff:
for i,v in item:
self.index.getdefault(i,set()).add(v)
def Get(self, col, have): # this O() matters
ret = []
t = array(have) # make a copy.
for i in self.index[col]:
t[col] = i
if t in self.all:
ret.append(i)
return ret
The problem is that this give really bad (O(n)) worst case perf.
Since you are asking for a SQL-like query, how about using a relational database? SQLite is part of the standard library, and can be used either on-disk or fully in memory.
If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.
If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N) is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.
Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).
So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.
For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the ith one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.
BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*) in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.
Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:
import collections
class Searchable(object):
def __init__(self, toindex=None):
self.toindex = toindex
self.data = []
self.indices = None
def makeindices(self):
if self.indices is not None:
return
self.indices = dict((i, collections.defaultdict(set))
for i in self.toindex)
def add(self, record):
if self.toindex is None:
self.toindex = range(len(record))
self.makeindices()
where = len(self.data)
self.data.append(record)
for i in self.toindex:
self.indices[i][record[i]].add(where)
def get(self, indices_and_values, indices_to_get):
ok = set(range(len(self.data)))
for i, v in indices_and_values:
ok.intersection_update(self.indices[i][v])
result = set()
for rec in (self.data[i] for i in ok):
t = tuple(rec[i] for i in indices_to_get)
result.add(t)
return result
def main():
c = Searchable()
for r in ((1,2,3), (1,2,4), (1,5,4)):
c.add(r)
print c.get([(0,1),(1,2)], [2])
main()
This prints
set([(3,), (4,)])
and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.
I'm assuming that you've tried the dictionary and you need something more flexible. Basically, what you need to do is index values of x, y and z
def build_index(vectors):
index = {x: {}, y: {}, z: {}}
for position, vector in enumerate(vectors):
if vector.x in index['x']:
index['x'][vector.x].append(position)
else:
index['x'][vector.x] = [position]
if vector.y in index['y']:
index['y'][vector.y].append(position)
else:
index['y'][vector.y] = [position]
if vector.z in index['z']:
index['z'][vector.z].append(position)
else:
index['z'][vector.z] = [position]
return index
What you have in index a lookup table. You can say, for example, select x,y,z from vectors where x=42 by doing this:
def query_by(vectors, index, property, value):
results = []
for i in index[property][value]:
results.append(vectors[i])
vecs_x_42 = query_by(index, 'x', 42)
# now vec_x_42 is a list of all vectors where x is 42
Now to do a logical conjunction, say select x,y,z from vectors where x=42 and y=3 you can use Python's sets to accomplish this:
def query_by(vectors, index, criteria):
sets = []
for k, v in criteria.iteritems():
if v not in index[k]:
return []
sets.append(index[k][v])
results = []
for i in set.intersection(*sets):
results.append(vectors[i])
return results
vecs_x_42_y_3 = query_by(index, {'x': 42, 'y': 3})
The intersection operation on sets produces values which only appear in both sets, so you are only iterating the positions which satisfy both conditions.
Now for the last part of your question, to group by x:
def group_by(vectors, property):
result = {}
for v in vectors:
value = getattr(v, property)
if value in result:
result[value].append(v)
else:
result[value] = [v]
return result
So let's bring it all together:
vectors = [...] # your vectors, as objects such that v.x, v.y produces the x and y values
index = build_index(vectors)
my_vectors = group_by(query_by(vectors, index, {'y':42, 'z': 3}), 'x')
# now you have in my_vectors a dictionary of vectors grouped by x value, where y=42 and z=3
Update
I updated the code above and fixed a few obvious errors. It works now and it does what it claims to do. On my laptop, a 2GHz core 2 duo with 4GB RAM, it takes less than 1s to build_index. Lookups are very quick, even when the dataset has 100k vectors. If I have some time I'll do some formal comparisons against MySQL.
You can see the full code at this Codepad, if you time it or improve it, let me know.
Suppose you have a 'tuple' class with fields x, y, and z and you have a bunch of such tuples saved in an enumerable var named myTuples. Then:
A) Pre-population:
dct = {}
for tpl in myTuples:
tmp = (tpl.y, tpl.z)
if tmp in dct:
dct[tmp].append(tpl.x)
else:
dct[tmp] = [tpl.x]
B) Query:
def findAll(y,z):
tmp = (y,z)
if tmp not in dct: return ()
return [(x,y,z) for x in dct[tmp]]
I am sure that there is a way to optimize the code for readability, save a few cycles, etc. But essentially you want to pre-populate a dict, using a 2-tuple as a key. If I did not see a request for sub-linear then I would not have though of this :)
A) The pre-population is linear, sorry.
B) Query should be as slow as the number of items returned - most of the time sub-linear, except for weird edge cases.
So you have 3 coordinates and one value for start and end of vector (x,y,z)?
How is it possible to know the seven known values? Are there many coordinate triples multiple times?
You must be doing very tight loop with the function to be so conserned of look up time considering the small size of data (10K).
Could you give example of real input for your class you posted?

Memory Efficient Alternatives to Python Dictionaries

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.
This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.
My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.
From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.
So, does anyone have any suggestions for me?
Edited to add:
Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).
The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.
Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:
295760 S. Lott's answer
237984 S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156 d[a][b][c] = int(freq)
189132 keys.append((a,b,c)); freqs.append(int(freq))
146132 d[intern(a),intern(b)][intern(c)] = int(freq)
145408 d[intern(a)][intern(b)][intern(c)] = int(freq)
83888 [*] d[a+' '+b+' '+c] = int(freq)
82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
68756 keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
60320 keys.append(a+' '+b+' '+c); freqs.append(int(freq))
50556 pair array
48320 squeezed pair array
33024 squeezed single array
The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)
'Pair array' is the scheme below in my original answer ("I'd start with the array with keys
being the first two words..."), where the value table for each pair is
represented as a single string. 'Squeezed pair array' is the same,
leaving out the frequency values that are equal to 1 (the most common
case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:
import collections
def build(file):
pairs = collections.defaultdict(list)
for line in file: # N.B. file assumed to be already sorted
a, b, c, freq = line.split()
key = ' '.join((a, b))
pairs[key].append(c + ':' + freq if freq != '1' else c)
out = open('squeezedsinglearrayfile', 'w')
for key in sorted(pairs.keys()):
out.write('%s|%s\n' % (key, ' '.join(pairs[key])))
def load():
return open('squeezedsinglearrayfile').readlines()
if __name__ == '__main__':
build(open('freqs'))
I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.
Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.
Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.
I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)
Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.
d = {}
d[ word1, word2, word3 ] = 1
Also as a plus, you could use defaultdict
so that elements that don't have entries always return 0
and so that u can say d[w1,w2,w3] += 1 without checking if the key already exists or not
example:
from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1
If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension
if you have a tuple, t, you can get the first two items using slices:
>>> a = (1,2,3)
>>> a[:2]
(1, 2)
a small example for searching tuples with list comprehensions:
>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]
You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)
In this case, ZODB¹ BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.
Something like:
from BTrees.OOBTree import OOBTree as BTree
The interface is, more or less, dict-like, with the added bonus (for you) that .keys, .items, .iterkeys and .iteritems have two min, max optional arguments:
>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]
¹ Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.
PS They exist in the CheeseShop ☺
A couple attempts:
I figure you're doing something similar to this:
from __future__ import with_statement
import time
from collections import deque, defaultdict
# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
d=deque()
with open(words) as f:
for i in range(3):
d.append(f.readline().strip())
while d[-1] != '':
yield tuple(d)
d.popleft()
d.append(f.readline().strip())
if __name__ == '__main__':
class D(dict):
def __missing__(self, key):
self[key] = D()
return self[key]
h=D()
for a, b, c in triplegen():
h[a][b][c] = 1
time.sleep(60)
That gives me ~88MB.
Changing the storage to
h[a, b, c] = 1
takes ~25MB
interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.
Are you implementing Markovian text generation?
If your chains map 2 words to the probabilities of the third I'd use a dictionary mapping K-tuples to the 3rd-word histogram. A trivial (but memory-hungry) way to implement the histogram would be to use a list with repeats, and then random.choice gives you a word with the proper probability.
Here's an implementation with the K-tuple as a parameter:
import random
# can change these functions to use a dict-based histogram
# instead of a list with repeats
def default_histogram(): return []
def add_to_histogram(item, hist): hist.append(item)
def choose_from_histogram(hist): return random.choice(hist)
K=2 # look 2 words back
words = ...
d = {}
# build histograms
for i in xrange(len(words)-K-1):
key = words[i:i+K]
word = words[i+K]
d.setdefault(key, default_histogram())
add_to_histogram(word, d[key])
# generate text
start = random.randrange(len(words)-K-1)
key = words[start:start+K]
for i in NUM_WORDS_TO_GENERATE:
word = choose_from_histogram(d[key])
print word,
key = key[1:] + (word,)
You could try to use same dictionary, only one level deep.
topDictionary[word1+delimiter+word2+delimiter+word3]
delimiter could be plain " ". (or use (word1,word2,word3))
This would be easiest to implement.
I believe you will see a little improvement, if it is not enough...
...i'll think of something...
Ok, so you are basically trying to store a sparse 3D space. The kind of access patterns you want to this space is crucial for the choice of algorithm and data structure. Considering your data source, do you want to feed this to a grid? If you don't need O(1) access:
In order to get memory efficiency you want to subdivide that space into subspaces with a similar number of entries. (like a BTree). So a data structure with :
firstWordRange
secondWordRange
thirdWordRange
numberOfEntries
a sorted block of entries.
next and previous blocks in all 3 dimensions
Scipy has sparse matrices, so if you can make the first two words a tuple, you can do something like this:
import numpy as N
from scipy import sparse
word_index = {}
count = sparse.lil_matrix((word_count*word_count, word_count), dtype=N.int)
for word1, word2, word3 in triple_list:
w1 = word_index.setdefault(word1, len(word_index))
w2 = word_index.setdefault(word2, len(word_index))
w3 = word_index.setdefault(word3, len(word_index))
w1_w2 = w1 * word_count + w2
count[w1_w2,w3] += 1
If memory is simply not big enough, pybsddb can help store a disk-persistent map.
You could use a numpy multidimensional array. You'll need to use numbers rather than strings to index into the array, but that can be solved by using a single dict to map words to numbers.
import numpy
w = {'word1':1, 'word2':2, 'word3':3, 'word4':4}
a = numpy.zeros( (4,4,4) )
Then to index into your array, you'd do something like:
a[w[word1], w[word2], w[word3]] += 1
That syntax is not beautiful, but numpy arrays are about as efficient as anything you're likely to find. Note also that I haven't tried this code out, so I may be off in some of the details. Just going from memory here.
Here's a tree structure that uses the bisect library to maintain a sorted list of words. Each lookup in O(log2(n)).
import bisect
class WordList( object ):
"""Leaf-level is list of words and counts."""
def __init__( self ):
self.words= [ ('\xff-None-',0) ]
def count( self, wordTuple ):
assert len(wordTuple)==1
word= wordTuple[0]
loc= bisect.bisect_left( self.words, word )
if self.words[loc][0] != word:
self.words.insert( loc, (word,0) )
self.words[loc]= ( word, self.words[loc][1]+1 )
def getWords( self ):
return self.words[:-1]
class WordTree( object ):
"""Above non-leaf nodes are words and either trees or lists."""
def __init__( self ):
self.words= [ ('\xff-None-',None) ]
def count( self, wordTuple ):
head, tail = wordTuple[0], wordTuple[1:]
loc= bisect.bisect_left( self.words, head )
if self.words[loc][0] != head:
if len(tail) == 1:
newList= WordList()
else:
newList= WordTree()
self.words.insert( loc, (head,newList) )
self.words[loc][1].count( tail )
def getWords( self ):
return self.words[:-1]
t = WordTree()
for a in ( ('the','quick','brown'), ('the','quick','fox') ):
t.count(a)
for w1,wt1 in t.getWords():
print w1
for w2,wt2 in wt1.getWords():
print " ", w2
for w3 in wt2.getWords():
print " ", w3
For simplicity, this uses a dummy value in each tree and list. This saves endless if-statements to determine if the list was actually empty before we make a comparison. It's only empty once, so the if-statements are wasted for all n-1 other words.
You could put all words in a dictionary.
key would be word, and value is number (index).
Then you use it like this:
Word1=indexDict[word1]
Word2=indexDict[word2]
Word3=indexDict[word3]
topDictionary[Word1][Word2][Word3]
Insert in indexDict with:
if word not in indexDict:
indexDict[word]=len(indexDict)

Categories