How to create a trie in Python - python
I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.
Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
How to implement word-blocks consisting of more than one word separated with - or space?
How to link prefix or suffix of a word to another part in the structure? (for DAWG)
I want to understand the best output structure in order to figure out how to create and use one.
I would also appreciate what should be the output of a DAWG along with trie.
I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.
Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:
>>> _end = '_end_'
>>>
>>> def make_trie(*words):
... root = dict()
... for word in words:
... current_dict = root
... for letter in word:
... current_dict = current_dict.setdefault(letter, {})
... current_dict[_end] = _end
... return root
...
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
'z': {'_end_': '_end_'}}},
'f': {'o': {'o': {'_end_': '_end_'}}}}
If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)
Next, a function to test whether the word is in the trie:
>>> def in_trie(trie, word):
... current_dict = trie
... for letter in word:
... if letter not in current_dict:
... return False
... current_dict = current_dict[letter]
... return _end in current_dict
...
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False
I'll leave insertion and removal to you as an exercise.
Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.
Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.
Here is a list of python packages that implement Trie:
marisa-trie - a C++ based implementation.
python-trie - a simple pure python implementation.
PyTrie - a more advanced pure python implementation.
pygtrie - a pure python implementation by Google.
datrie - a double array trie implementation based on libdatrie.
Have a look at this:
https://github.com/kmike/marisa-trie
Static memory-efficient Trie structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than
in a standard Python dict; the raw lookup speed is comparable; trie
also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/
At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.
...
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.
What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.
There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:
https://github.com/bdimmick/python-trie
https://pypi.python.org/pypi/PyTrie
Modified from senderle's method (above). I found that Python's defaultdict is ideal for creating a trie or a prefix tree.
from collections import defaultdict
class Trie:
"""
Implement a trie with insert, search, and startsWith methods.
"""
def __init__(self):
self.root = defaultdict()
# #param {string} word
# #return {void}
# Inserts a word into the trie.
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
# #param {string} word
# #return {boolean}
# Returns if the word is in the trie.
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# #param {string} prefix
# #return {boolean}
# Returns if there is any word in the trie
# that starts with the given prefix.
def startsWith(self, prefix):
current = self.root
for letter in prefix:
if letter not in current:
return False
current = current[letter]
return True
# Now test the class
test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')
print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')
There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.
I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.
Using defaultdict and reduce function.
Create Trie
from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True
Trie :
defaultdict(<function __main__.<lambda>()>,
{'h': defaultdict(<function __main__.<lambda>()>,
{'o': defaultdict(<function __main__.<lambda>()>,
{'w': defaultdict(<function __main__.<lambda>()>,
{'isEnd': True})})})})
Search In Trie :
curr = trie
for w in 'how':
if w in curr:
curr = curr[w]
else:
print("Not Found")
break
if curr['isEnd']:
print('Found')
from collections import defaultdict
Define Trie:
_trie = lambda: defaultdict(_trie)
Create Trie:
trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
curr = trie
for c in s:
curr = curr[c]
curr.setdefault("_end")
Lookup:
def word_exist(trie, word):
curr = trie
for w in word:
if w not in curr:
return False
curr = curr[w]
return '_end' in curr
Test:
print(word_exist(trie, 'cam'))
Here is full code using a TrieNode class. Also implemented auto_complete method to return the matching words with a prefix.
Since we are using dictionary to store children, there is no need to convert char to integer and vice versa and don't need to allocate array memory in advance.
class TrieNode:
def __init__(self):
#Dict: Key = letter, Item = TrieNode
self.children = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def build_trie(self,words):
for word in words:
self.insert(word)
def insert(self,word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.end = True
def search(self, word):
node = self.root
for char in word:
if char in node.children:
node = node.children[char]
else:
return False
return node.end
def _walk_trie(self, node, word, word_list):
if node.children:
for char in node.children:
word_new = word + char
if node.children[char].end:
# if node.end:
word_list.append( word_new)
# word_list.append( word)
self._walk_trie(node.children[char], word_new , word_list)
def auto_complete(self, partial_word):
node = self.root
word_list = [ ]
#find the node for last char of word
for char in partial_word:
if char in node.children:
node = node.children[char]
else:
# partial_word not found return
return word_list
if node.end:
word_list.append(partial_word)
# word_list will be created in this method for suggestions that start with partial_word
self._walk_trie(node, partial_word, word_list)
return word_list
create a Trie
t = Trie()
words = ['hi', 'hieght', 'rat', 'ram', 'rattle', 'hill']
t.build_trie(words)
Search for word
words = ['hi', 'hello']
for word in words:
print(word, t.search(word))
hi True
hel False
search for words using prefix
partial_word = 'ra'
t.auto_complete(partial_word)
['rat', 'rattle', 'ram']
If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:
class Trie:
def __init__(self):
self.__final = False
self.__nodes = {}
def __repr__(self):
return 'Trie<len={}, final={}>'.format(len(self), self.__final)
def __getstate__(self):
return self.__final, self.__nodes
def __setstate__(self, state):
self.__final, self.__nodes = state
def __len__(self):
return len(self.__nodes)
def __bool__(self):
return self.__final
def __contains__(self, array):
try:
return self[array]
except KeyError:
return False
def __iter__(self):
yield self
for node in self.__nodes.values():
yield from node
def __getitem__(self, array):
return self.__get(array, False)
def create(self, array):
self.__get(array, True).__final = True
def read(self):
yield from self.__read([])
def update(self, array):
self[array].__final = True
def delete(self, array):
self[array].__final = False
def prune(self):
for key, value in tuple(self.__nodes.items()):
if not value.prune():
del self.__nodes[key]
if not len(self):
self.delete([])
return self
def __get(self, array, create):
if array:
head, *tail = array
if create and head not in self.__nodes:
self.__nodes[head] = Trie()
return self.__nodes[head].__get(tail, create)
return self
def __read(self, name):
if self.__final:
yield name
for key, value in self.__nodes.items():
yield from value.__read(name + [key])
This version is using recursion
import pprint
from collections import deque
pp = pprint.PrettyPrinter(indent=4)
inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}
def trie_recursion(trie_ds, word):
try:
letter = word.popleft()
out = trie_recursion(trie_ds.get(letter, {}), word)
except IndexError:
# End of the word
return {}
# Dont update if letter already present
if not trie_ds.has_key(letter):
trie_ds[letter] = out
return trie_ds
for word in words:
# Go through each word
trie = trie_recursion(trie, deque(word))
pprint.pprint(trie)
Output:
Coool👾 <algos>🚸 python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
'b': {
'a': {
'r': {},
'z': {}
}
},
'f': {
'o': {
'o': {}
},
'u': {
'n': {}
}
}
}
This is much like a previous answer but simpler to read:
def make_trie(words):
trie = {}
for word in words:
head = trie
for char in word:
if char not in head:
head[char] = {}
head = head[char]
head["_end_"] = "_end_"
return trie
class TrieNode:
def __init__(self):
self.keys = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str, node=None) -> None:
if node == None:
node = self.root
# insertion is a recursive operation
# this is base case to exit the recursion
if len(word) == 0:
node.end = True
return
# if this key does not exist create a new node
elif word[0] not in node.keys:
node.keys[word[0]] = TrieNode()
self.insert(word[1:], node.keys[word[0]])
# that means key exists
else:
self.insert(word[1:], node.keys[word[0]])
def search(self, word: str, node=None) -> bool:
if node == None:
node = self.root
# this is positive base case to exit the recursion
if len(word) == 0 and node.end == True:
return True
elif len(word) == 0:
return False
elif word[0] not in node.keys:
return False
else:
return self.search(word[1:], node.keys[word[0]])
def startsWith(self, prefix: str, node=None) -> bool:
if node == None:
node = self.root
if len(prefix) == 0:
return True
elif prefix[0] not in node.keys:
return False
else:
return self.startsWith(prefix[1:], node.keys[prefix[0]])
class Trie:
head = {}
def add(self,word):
cur = self.head
for ch in word:
if ch not in cur:
cur[ch] = {}
cur = cur[ch]
cur['*'] = True
def search(self,word):
cur = self.head
for ch in word:
if ch not in cur:
return False
cur = cur[ch]
if '*' in cur:
return True
else:
return False
def printf(self):
print (self.head)
dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")
print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()
Out
True
False
False
False
{'h': {'i': {'*': True}}}
Python Class for Trie
Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion.
Can be clone from https://github.com/Parikshit22/pytrie.git
class Node:
def __init__(self):
self.children = [None]*26
self.isend = False
class trie:
def __init__(self,):
self.__root = Node()
def __len__(self,):
return len(self.search_byprefix(''))
def __str__(self):
ll = self.search_byprefix('')
string = ''
for i in ll:
string+=i
string+='\n'
return string
def chartoint(self,character):
return ord(character)-ord('a')
def remove(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
raise ValueError("Keyword doesn't exist in trie")
if ptr.isend is not True:
raise ValueError("Keyword doesn't exist in trie")
ptr.isend = False
return
def insert(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
ptr.children[i] = Node()
ptr = ptr.children[i]
ptr.isend = True
def search(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return False
if ptr.isend is not True:
return False
return True
def __getall(self,ptr,key,key_list):
if ptr is None:
key_list.append(key)
return
if ptr.isend==True:
key_list.append(key)
for i in range(26):
if ptr.children[i] is not None:
self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
def search_byprefix(self,key):
ptr = self.__root
key_list = []
length = len(key)
for idx in range(length):
i = self.chartoint(key[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return None
self.__getall(ptr,key,key_list)
return key_list
t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)
Code Oputpt
True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi
With prefix search
Here is #senderle's answer, slightly modified to accept prefix search (and not only whole-word matching):
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if _end in current_dict:
return True
if letter not in current_dict:
return False
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
print(in_trie(t, 'hello world'))
# True
In response to #basj
The following code will capture \b (end of word) letters.
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if letter not in current_dict: # Adjusted the
return False # order of letter
if _end in current_dict[letter]: # checks to capture
return True # the last letter.
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
>>> print(in_trie(t, 'hi'))
True
>>> print(in_trie(t, 'hola'))
False
>>> print(in_trie(t, 'hello friend'))
True
>>> print(in_trie(t, 'hel'))
None
Related
Cannot write a function to retrieve all words in a trie
I have a following Trie implementation: class TrieNode: def __init__(self): self.nodes = defaultdict(TrieNode) self.is_fullpath = False class Trie: def __init__(self): self.root = TrieNode() def insert(self, word): curr = self.root for char in word: curr = curr.nodes[char] curr.is_fullpath = True I'm trying to write a method to retrieve a list of all words in my trie. t = Trie() t.insert('a') t.insert('ab') print(t.paths()) # ---> ['a', 'ab'] My current implementation looks like this: def paths(self, node=None): if node is None: node = self.root result = [] for k, v in node.nodes.items(): if not node.is_fullpath: for el in self.paths(v): result.append(str(k) + el) else: result.append('') return result But it does not seem to return full list of words.
Here are the issues in your code: It doesn't look further when is_fullpath is True. But you should also look deeper (for longer words) in that case. It should not check node.is_fullpath but v.is_fullpath. result.append('') is not correct. It should be result.append(str(k)) So your for loop body could look like this: if v.is_fullpath: result.append(str(k)) for el in self.paths(v): result.append(str(k) + el) I would however do it like this: Define this recursive generator method on your TrieNode class: def paths(self, prefix=""): if self.is_fullpath: yield prefix for chr, node in self.nodes.items(): yield from node.paths(prefix + chr) Note how this passes the collected characters on the path to the recursive call. If at any time the is_fullpath boolean is True, we yield that path. Always we continue the search recursively via child nodes. The method on the Trie class is then quite simple: def paths(self): return list(self.root.paths())
Passing a list of strings to be put into trie
I have the code that can build a trie data structure when it is given one string. When I am trying to pass a list of strings, it combines the words into one class TrieNode: def __init__(self): self.end = False self.children = {} def all_words(self, prefix): if self.end: yield prefix for letter, child in self.children.items(): yield from child.all_words(prefix + letter) class Trie: def __init__(self): self.root = TrieNode() def __init__(self): self.root = TrieNode() def insert(self, words): curr = self.root #the line I added to read the words from a list is below for word in words: for letter in word: node = curr.children.get(letter) if not node: node = TrieNode() curr.children[letter] = node curr = node curr.end = True def all_words_beginning_with_prefix(self, prefix): cur = self.root for c in prefix: cur = cur.children.get(c) if cur is None: return # No words with given prefix yield from cur.all_words(prefix) This is the code I use to insert everything into the tree: lst = ['foo', 'foob', 'foobar', 'foof'] trie = Trie() trie.insert(lst) The output I get is ['foo', 'foofoob', 'foofoobfoobar', 'foofoobfoobarfoof'] The output I would like to get is ['foo', 'foob', 'foobar', 'foof'] This is the line I used to get the output (for reproducibility, in case you will need to run the code) - it returns all the words that start with a particular prefix: print(list(trie.all_words_beginning_with_prefix('foo'))) How do I fix it?
You aren't resetting curr back to the root after each insert, so you're inserting the next word where the last one left off. You'd want something like: def insert(self, words): curr = self.root for word in words: for letter in word: node = curr.children.get(letter) if not node: node = TrieNode() curr.children[letter] = node curr = node curr.end = True curr = self.root # Reset back to the root I'd break this up though. I think your insert function is doing too much, and shouldn't be dealing with multiple strings. I'd change it to something like: def insert(self, word): curr = self.root for letter in word: node = curr.children.get(letter) if not node: node = TrieNode() curr.children[letter] = node curr = node curr.end = True def insert_many(self, words): for word in words: self.insert(word) # Just loop over self.insert Now that's a non-problem since each insert is an independent call, and you can't forget to reset curr.
Storing word count in the python trie
I took a list of words and put it into a trie. I would also like to store word count inside for further analysis. What would be the best way to do it? This is the class where I think the frequency would be collected and stored, but I am not sure how to go about it. You can see my attempt, last line in insert is where I try to store the count. class TrieNode: def __init__(self,k): self.v = 0 self.k = k self.children = {} def all_words(self, prefix): if self.end: yield prefix for letter, child in self.children.items(): yield from child.all_words(prefix + letter) class Trie: def __init__(self): self.root = TrieNode() def __init__(self): self.root = TrieNode() def insert(self, word): curr = self.root for letter in word: node = curr.children.get(letter) if not node: node = TrieNode() curr.children[letter] = node curr.v += 1 def insert_many(self, words): for word in words: self.insert(word) def all_words_beginning_with_prefix(self, prefix): cur = self.root for c in prefix: cur = cur.children.get(c) if cur is None: return # No words with given prefix yield from cur.all_words(prefix) I want to store the count so that when I use print(list(trie.all_words_beginning_with_prefix('prefix'))) I would get a result like so: [(word, count), (word, count)]
While inserting, on seeing any node, it means there's a new word going to be added in that path. Therefore increment your word_count of that node. class TrieNode: def __init__(self, char): self.char = char self.word_count = 0 self.children = {} def all_words(self, prefix, path): if len(self.children) == 0: yield prefix + path for letter, child in self.children.items(): yield from child.all_words(prefix, path + letter) class Trie: def __init__(self): self.root = TrieNode('') def insert(self, word): curr = self.root for letter in word: node = curr.children.get(letter) if node is None: node = TrieNode(letter) curr.children[letter] = node curr.word_count += 1 # increment it everytime the node is seen at particular level. curr = node def insert_many(self, words): for word in words: self.insert(word) def all_words_beginning_with_prefix(self, prefix): cur = self.root for c in prefix: cur = cur.children.get(c) if cur is None: return # No words with given prefix yield from cur.all_words(prefix, path="") def word_count(self, prefix): cur = self.root for c in prefix: cur = cur.children.get(c) if cur is None: return 0 return cur.word_count trie = Trie() trie.insert_many(["hello", "hi", "random", "heap"]) prefix = "he" words = [w for w in trie.all_words_beginning_with_prefix(prefix)] print("Lazy method:\n Prefix: %s, Words: %s, Count: %d" % (prefix, words, len(words))) print("Proactive method:\n Word count for '%s': %d" % (prefix, trie.word_count(prefix))) Output: Lazy method: Prefix: he, Words: ['hello', 'heap'], Count: 2 Proactive method: Word count for 'he': 2
I would add a field called is_word to the trie node, where is_word would be true only for the last letter in the word. Like you have word AND, is_word would be true for the trie node holding the letter D. And I would update frequency for only nodes that have is_word to be true, not for every letter in the word. So when you iterate from a letter, check if it is a word, if it is, stop the iteration, return the count and the word. I’m assuming in your iteration you keep track of the letters, and keep adding them to the prefix. Your trie is a multi-way trie.
Trie Implementation in Python -- Print Keys
I Implemented a Trie data structure using python, now the problem is it doesn't display the keys that Trie is stored in its data structure. class Node: def __init__(self): self.children = [None] * 26 self.endOfTheWord = False class Trie: def __init__(self): self.root = self.getNode() def getNode(self): return Node() def charToIndex(self ,ch): return ord(ch) - ord('a') def insert(self ,word): current = self.root for i in range(len(word)): index = self.charToIndex(word[i]) if current.children[index] is None: current.children[index] = self.getNode() current = current.children[index] current.endOfTheWord = True def printKeys(self): str = [] self.printKeysUtil(self.root ,str) def printKeysUtil(self ,root ,str): if root.endOfTheWord == True: print(''.join(str)) return for i in range(26): if root.children[i] is not None: ch = chr(97) + chr(i) str.append(ch) self.printKeysUtil(root.children[i] ,str) str.pop()
You could perform a pre-order traversal of the nodes, and wherever you find an end-of-word marker, you zoom up to the root, capturing the letters as you go, in order to get the full word... except that to accomplish this, you would need to store the parent node in each node.
def printKeysUtil(self ,root ,str): if root.endOfTheWord == True: print(''.join(str)) return for i in range(26): if root.children[i] is not None: ch = chr(97+i) str.append(ch) self.printKeysUtil(root.children[i] ,str) str.pop()
Traversing over dictionary using variable number of keys [duplicate]
I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like. Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on? Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries? How to implement word-blocks consisting of more than one word separated with - or space? How to link prefix or suffix of a word to another part in the structure? (for DAWG) I want to understand the best output structure in order to figure out how to create and use one. I would also appreciate what should be the output of a DAWG along with trie. I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.
Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie: >>> _end = '_end_' >>> >>> def make_trie(*words): ... root = dict() ... for word in words: ... current_dict = root ... for letter in word: ... current_dict = current_dict.setdefault(letter, {}) ... current_dict[_end] = _end ... return root ... >>> make_trie('foo', 'bar', 'baz', 'barz') {'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 'z': {'_end_': '_end_'}}}, 'f': {'o': {'o': {'_end_': '_end_'}}}} If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.) Next, a function to test whether the word is in the trie: >>> def in_trie(trie, word): ... current_dict = trie ... for letter in word: ... if letter not in current_dict: ... return False ... current_dict = current_dict[letter] ... return _end in current_dict ... >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz') True >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz') True >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz') False >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart') False >>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba') False I'll leave insertion and removal to you as an exercise. Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists. Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.
Here is a list of python packages that implement Trie: marisa-trie - a C++ based implementation. python-trie - a simple pure python implementation. PyTrie - a more advanced pure python implementation. pygtrie - a pure python implementation by Google. datrie - a double array trie implementation based on libdatrie.
Have a look at this: https://github.com/kmike/marisa-trie Static memory-efficient Trie structures for Python (2.x and 3.x). String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search. Based on marisa-trie C++ library. Here's a blog post from a company using marisa trie successfully: https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/ At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. ... I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar. What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation. There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance: https://github.com/bdimmick/python-trie https://pypi.python.org/pypi/PyTrie
Modified from senderle's method (above). I found that Python's defaultdict is ideal for creating a trie or a prefix tree. from collections import defaultdict class Trie: """ Implement a trie with insert, search, and startsWith methods. """ def __init__(self): self.root = defaultdict() # #param {string} word # #return {void} # Inserts a word into the trie. def insert(self, word): current = self.root for letter in word: current = current.setdefault(letter, {}) current.setdefault("_end") # #param {string} word # #return {boolean} # Returns if the word is in the trie. def search(self, word): current = self.root for letter in word: if letter not in current: return False current = current[letter] if "_end" in current: return True return False # #param {string} prefix # #return {boolean} # Returns if there is any word in the trie # that starts with the given prefix. def startsWith(self, prefix): current = self.root for letter in prefix: if letter not in current: return False current = current[letter] return True # Now test the class test = Trie() test.insert('helloworld') test.insert('ilikeapple') test.insert('helloz') print test.search('hello') print test.startsWith('hello') print test.search('ilikeapple')
There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion. I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.
Using defaultdict and reduce function. Create Trie from functools import reduce from collections import defaultdict T = lambda : defaultdict(T) trie = T() reduce(dict.__getitem__,'how',trie)['isEnd'] = True Trie : defaultdict(<function __main__.<lambda>()>, {'h': defaultdict(<function __main__.<lambda>()>, {'o': defaultdict(<function __main__.<lambda>()>, {'w': defaultdict(<function __main__.<lambda>()>, {'isEnd': True})})})}) Search In Trie : curr = trie for w in 'how': if w in curr: curr = curr[w] else: print("Not Found") break if curr['isEnd']: print('Found')
from collections import defaultdict Define Trie: _trie = lambda: defaultdict(_trie) Create Trie: trie = _trie() for s in ["cat", "bat", "rat", "cam"]: curr = trie for c in s: curr = curr[c] curr.setdefault("_end") Lookup: def word_exist(trie, word): curr = trie for w in word: if w not in curr: return False curr = curr[w] return '_end' in curr Test: print(word_exist(trie, 'cam'))
Here is full code using a TrieNode class. Also implemented auto_complete method to return the matching words with a prefix. Since we are using dictionary to store children, there is no need to convert char to integer and vice versa and don't need to allocate array memory in advance. class TrieNode: def __init__(self): #Dict: Key = letter, Item = TrieNode self.children = {} self.end = False class Trie: def __init__(self): self.root = TrieNode() def build_trie(self,words): for word in words: self.insert(word) def insert(self,word): node = self.root for char in word: if char not in node.children: node.children[char] = TrieNode() node = node.children[char] node.end = True def search(self, word): node = self.root for char in word: if char in node.children: node = node.children[char] else: return False return node.end def _walk_trie(self, node, word, word_list): if node.children: for char in node.children: word_new = word + char if node.children[char].end: # if node.end: word_list.append( word_new) # word_list.append( word) self._walk_trie(node.children[char], word_new , word_list) def auto_complete(self, partial_word): node = self.root word_list = [ ] #find the node for last char of word for char in partial_word: if char in node.children: node = node.children[char] else: # partial_word not found return return word_list if node.end: word_list.append(partial_word) # word_list will be created in this method for suggestions that start with partial_word self._walk_trie(node, partial_word, word_list) return word_list create a Trie t = Trie() words = ['hi', 'hieght', 'rat', 'ram', 'rattle', 'hill'] t.build_trie(words) Search for word words = ['hi', 'hello'] for word in words: print(word, t.search(word)) hi True hel False search for words using prefix partial_word = 'ra' t.auto_complete(partial_word) ['rat', 'rattle', 'ram']
If you want a TRIE implemented as a Python class, here is something I wrote after reading about them: class Trie: def __init__(self): self.__final = False self.__nodes = {} def __repr__(self): return 'Trie<len={}, final={}>'.format(len(self), self.__final) def __getstate__(self): return self.__final, self.__nodes def __setstate__(self, state): self.__final, self.__nodes = state def __len__(self): return len(self.__nodes) def __bool__(self): return self.__final def __contains__(self, array): try: return self[array] except KeyError: return False def __iter__(self): yield self for node in self.__nodes.values(): yield from node def __getitem__(self, array): return self.__get(array, False) def create(self, array): self.__get(array, True).__final = True def read(self): yield from self.__read([]) def update(self, array): self[array].__final = True def delete(self, array): self[array].__final = False def prune(self): for key, value in tuple(self.__nodes.items()): if not value.prune(): del self.__nodes[key] if not len(self): self.delete([]) return self def __get(self, array, create): if array: head, *tail = array if create and head not in self.__nodes: self.__nodes[head] = Trie() return self.__nodes[head].__get(tail, create) return self def __read(self, name): if self.__final: yield name for key, value in self.__nodes.items(): yield from value.__read(name + [key])
This version is using recursion import pprint from collections import deque pp = pprint.PrettyPrinter(indent=4) inp = raw_input("Enter a sentence to show as trie\n") words = inp.split(" ") trie = {} def trie_recursion(trie_ds, word): try: letter = word.popleft() out = trie_recursion(trie_ds.get(letter, {}), word) except IndexError: # End of the word return {} # Dont update if letter already present if not trie_ds.has_key(letter): trie_ds[letter] = out return trie_ds for word in words: # Go through each word trie = trie_recursion(trie, deque(word)) pprint.pprint(trie) Output: Coool👾 <algos>🚸 python trie.py Enter a sentence to show as trie foo bar baz fun { 'b': { 'a': { 'r': {}, 'z': {} } }, 'f': { 'o': { 'o': {} }, 'u': { 'n': {} } } }
This is much like a previous answer but simpler to read: def make_trie(words): trie = {} for word in words: head = trie for char in word: if char not in head: head[char] = {} head = head[char] head["_end_"] = "_end_" return trie
class TrieNode: def __init__(self): self.keys = {} self.end = False class Trie: def __init__(self): self.root = TrieNode() def insert(self, word: str, node=None) -> None: if node == None: node = self.root # insertion is a recursive operation # this is base case to exit the recursion if len(word) == 0: node.end = True return # if this key does not exist create a new node elif word[0] not in node.keys: node.keys[word[0]] = TrieNode() self.insert(word[1:], node.keys[word[0]]) # that means key exists else: self.insert(word[1:], node.keys[word[0]]) def search(self, word: str, node=None) -> bool: if node == None: node = self.root # this is positive base case to exit the recursion if len(word) == 0 and node.end == True: return True elif len(word) == 0: return False elif word[0] not in node.keys: return False else: return self.search(word[1:], node.keys[word[0]]) def startsWith(self, prefix: str, node=None) -> bool: if node == None: node = self.root if len(prefix) == 0: return True elif prefix[0] not in node.keys: return False else: return self.startsWith(prefix[1:], node.keys[prefix[0]])
class Trie: head = {} def add(self,word): cur = self.head for ch in word: if ch not in cur: cur[ch] = {} cur = cur[ch] cur['*'] = True def search(self,word): cur = self.head for ch in word: if ch not in cur: return False cur = cur[ch] if '*' in cur: return True else: return False def printf(self): print (self.head) dictionary = Trie() dictionary.add("hi") #dictionary.add("hello") #dictionary.add("eye") #dictionary.add("hey") print(dictionary.search("hi")) print(dictionary.search("hello")) print(dictionary.search("hel")) print(dictionary.search("he")) dictionary.printf() Out True False False False {'h': {'i': {'*': True}}}
Python Class for Trie Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion. Can be clone from https://github.com/Parikshit22/pytrie.git class Node: def __init__(self): self.children = [None]*26 self.isend = False class trie: def __init__(self,): self.__root = Node() def __len__(self,): return len(self.search_byprefix('')) def __str__(self): ll = self.search_byprefix('') string = '' for i in ll: string+=i string+='\n' return string def chartoint(self,character): return ord(character)-ord('a') def remove(self,string): ptr = self.__root length = len(string) for idx in range(length): i = self.chartoint(string[idx]) if ptr.children[i] is not None: ptr = ptr.children[i] else: raise ValueError("Keyword doesn't exist in trie") if ptr.isend is not True: raise ValueError("Keyword doesn't exist in trie") ptr.isend = False return def insert(self,string): ptr = self.__root length = len(string) for idx in range(length): i = self.chartoint(string[idx]) if ptr.children[i] is not None: ptr = ptr.children[i] else: ptr.children[i] = Node() ptr = ptr.children[i] ptr.isend = True def search(self,string): ptr = self.__root length = len(string) for idx in range(length): i = self.chartoint(string[idx]) if ptr.children[i] is not None: ptr = ptr.children[i] else: return False if ptr.isend is not True: return False return True def __getall(self,ptr,key,key_list): if ptr is None: key_list.append(key) return if ptr.isend==True: key_list.append(key) for i in range(26): if ptr.children[i] is not None: self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list) def search_byprefix(self,key): ptr = self.__root key_list = [] length = len(key) for idx in range(length): i = self.chartoint(key[idx]) if ptr.children[i] is not None: ptr = ptr.children[i] else: return None self.__getall(ptr,key,key_list) return key_list t = trie() t.insert("shubham") t.insert("shubhi") t.insert("minhaj") t.insert("parikshit") t.insert("pari") t.insert("shubh") t.insert("minakshi") print(t.search("minhaj")) print(t.search("shubhk")) print(t.search_byprefix('m')) print(len(t)) print(t.remove("minhaj")) print(t) Code Oputpt True False ['minakshi', 'minhaj'] 7 minakshi minhajsir pari parikshit shubh shubham shubhi
With prefix search Here is #senderle's answer, slightly modified to accept prefix search (and not only whole-word matching): _end = '_end_' def make_trie(words): root = dict() for word in words: current_dict = root for letter in word: current_dict = current_dict.setdefault(letter, {}) current_dict[_end] = _end return root def in_trie(trie, word): current_dict = trie for letter in word: if _end in current_dict: return True if letter not in current_dict: return False current_dict = current_dict[letter] t = make_trie(['hello', 'hi', 'foo', 'bar']) print(in_trie(t, 'hello world')) # True
In response to #basj The following code will capture \b (end of word) letters. _end = '_end_' def make_trie(words): root = dict() for word in words: current_dict = root for letter in word: current_dict = current_dict.setdefault(letter, {}) current_dict[_end] = _end return root def in_trie(trie, word): current_dict = trie for letter in word: if letter not in current_dict: # Adjusted the return False # order of letter if _end in current_dict[letter]: # checks to capture return True # the last letter. current_dict = current_dict[letter] t = make_trie(['hello', 'hi', 'foo', 'bar']) >>> print(in_trie(t, 'hi')) True >>> print(in_trie(t, 'hola')) False >>> print(in_trie(t, 'hello friend')) True >>> print(in_trie(t, 'hel')) None