I was trying to see if we can use a constructor inside defaultdict and I am not able to run the code and get a recursion error. Just wondering if it is possible:
from collections import defaultdict
class TrieNode:
def __init__(self, char):
self.children = defaultdict(TrieNode(char))
self.is_word = False
a = TrieNode('b')
There is nothing wrong with using a defaultdict in you constructor. The problem is that you need to pass it a function that it will call when you add new keys. You are currently calling the function when you make the dictionary. As a result you keep calling TrieNode('b') infinitely.
You need to call it with something like:
self.children = defaultdict(TrieNode)
Then when you reference an unknown key in children it will call TrieNode() for you. This means, however, that you don't want to take an additional argument in the constructor.
That's probably ok because you generally add words to a trie and will need to add many words through the same node. One option would be to do something like:
from collections import defaultdict
class TrieNode:
def __init__(self):
self.children = defaultdict(TrieNode)
self.is_word = False
self.val = ''
def add(self, word):
self.val= word[0]
if (len(word) == 1):
self.is_word = True
else:
self.children[word[0]].add(word[1:])
def words(self):
if self.is_word:
yield self.val
for letter, node in self.children.items():
yield from (letter + child for child in node.words())
You can then add words to it and it will make TrieNodes in the default dictionary as it goes:
node = TrieNode()
node.add("dog")
node.add("catnip")
node.add("cats")
node.add("cat")
node.add("crunch")
node.children['c'].children
> defaultdict(__main__.TrieNode,
{'a': <__main__.TrieNode at 0x179c70048>,
'r': <__main__.TrieNode at 0x179c70eb8>})
You can see that your children has a c key which points to a TrieNode whose children is the defaultdict with a and r pointing to the next.
This allows you to easily pull out the words with a generator:
list(node.words())
> ['dog', 'cat', 'cats', 'catnip', 'crunch']
Related
I have a list of fruit names that I need to keep as unique string identifiers. Here's an example:
fruit_names = ['banana_001','apple_001','banana_002']
There is also a function that acts on fruit_names and adds a new fruit to the list. If it finds the fruit on this list, it increments the ID after the underscore by 1. If it doesn't find the fruit at all, it starts the naming at 1:
>>>fruits.add(fruit_names,'apple')
>>>fruits.add(fruit_names,'orange')
>>>print(fruit_names)
['banana_001', 'apple_001', 'banana_002', 'apple_002', 'orange_001']
I have a hacky implementation I am not happy with for fruits.add() and was wondering if there's a super simple way of accomplishing the above that I may be missing.
You can create a function _add:
def _add(_l:list, _item:str) -> list:
return _l+[_item +'_'+str(sum(c.split('_')[0] == _item for c in _l)+1).zfill(3)]
fruit_names = ['banana_001','apple_001','banana_002']
for i in ['banana', 'apple', 'pear']:
fruit_names = _add(fruit_names, i)
Output:
['banana_001', 'apple_001', 'banana_002', 'banana_003', 'apple_002', 'pear_001']
Edit: if you wish to treat add as a method, you can create a linked-list:
class Fruits:
def __init__(self, _val = None, _c = None):
self.head, self._next = _val if _val is None else f'{_val}_{str(_c).zfill(3)}', None
def __str__(self):
return self.head if self._next is None else f'{self.head}, {str(self._next)}'
def __repr__(self):
return f'[{str(self)}]'
def add(self, fruit, _count = 1):
if self.head is None:
self.head = f'{fruit}_{str(_count).zfill(3)}'
else:
getattr(self._next, 'add', lambda x, y:setattr(self, '_next', Fruits(x, y)))(fruit, _count+(self.head.split('_')[0] == fruit))
fruits = Fruits()
for i in ['banana','apple','banana', 'peach', 'orange', 'banana', 'apple', 'banana']:
fruits.add(i)
Output:
[banana_001, apple_001, banana_002, peach_001, orange_001, banana_003, apple_002, banana_004]
The solutions presented by Ajax1234 are pretty good, but they would be slow if you have a vary large list to go through. That's because each addition takes O(N) time, since it needs to recount the number of times that the new fruit has appeared so far in the list. If instead you kept track of the count as you go using a dictionary, it can be much more efficient (amortized O(1) time per addition).
class FruitList():
def __init__(self):
self.fruits = []
self.counts = {}
def add(self, fruit):
self.counts[fruit] = count = self.counts.get(fruit, 0) + 1
self.fruits.append(f"{fruit}_{count:03d}")
def __repr__(self):
return repr(self.fruits)
Here's an example of it in action using one-character fruit names (just because it's easy to generate them):
>>> fl = FruitList()
>>> for f in 'abcabbaabc':
fl.add(f)
>>> fl
['a_001', 'b_001', 'c_001', 'a_002', 'b_002', 'b_003', 'a_003', 'a_004', 'b_004', 'c_002']
I am looking for some tips about how to decrease memory usage for python. I am using this piece of code as the main structure to hold my data:
http://stevehanov.ca/blog/index.php?id=114
I need it to serve for proximity word matching using a flask server. I need to put much more than 20 millions of different strings (and it will increase). Now I get MemoryError when trying to put around 14 millions in the Trie.
I just add a dictionary to hold some value with quick access (I need it, but it can be considered as a kind of ID of appearance, it is not directly related to the word)
class TrieNode:
values = {}
def __init__(self):
self.word = None
self.children = {}
global NodeCount
NodeCount += 1
def insert( self, word, value):
node = self
for letter in word:
if letter not in node.children:
node.children[letter] = TrieNode()
node = node.children[letter]
TrieNode.values[word] = value
node.word = word
I am not familiar with Python optimization, is there any way to make the "letter" object less big to save some memory?
Please note that my difficulty come from the fact that this letter is not only [a-z] but need to handle all the "unicode range" (like accentuated chars but not only). BTW it is a single character, so it should be quite light from the memory fingerprint. How can I use the codepoint instead of the string object (will it be more memory efficient)?
EDIT: adding some other informations following reply from #juanpa-arrivillaga
so, first I see no difference using the slot construct, on my computer, with or without __slot__ I see the same memory usage.
with __slot__ :
>>> class TrieNode:
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount # my goal is to encapsulated the NodeCount in the class itself
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
without __slot__:
>>> class TrieNode:
NodeCount = 0
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
so I do not understand, why. Where am i wrong ?
here is something else what I tried too, using "intern" keyword, because this value is a string handling an "id" (and so is not related to unicode, not like letter) :
btw my goal was to have with values and NodeCount, the equivalent concept for class/static variables so that each of them is shared by all the instance of the small created objets, I thought it would preserve memory and avoid duplicate, but I may be wrong from my understanding about "static-like" concept in Python)
class TrieNode:
values = {} # shared amon all instances so only one structure?
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
def insert( self, word, value = None):
# value is a string id like "XYZ999999999"
node = self
for letter in word:
codepoint = ord(letter)
if codepoint not in node.children:
node.children[codepoint] = TrieNode()
node = node.children[codepoint]
node.word = word
if value is not None:
lost = TrieNode.values.setdefault(word, [])
TrieNode.values[word].append(intern(str(value)))
ADDED:
Last, I should have precised that i am using Python 2.7.x family.
I was wondering if there were any fixed len data types from library like numpy could help me to save some memory, again as new, i do not know where to look. Btw "word" are not real "natural language word" but "arbitrary length sequence of characters" and they can also be very long.
from your reply, I agree that avoiding to store the word in each node would be efficient, but you need to have a look to the linked article/piece of code. The main goal is not to reconstruct this word but to be able to do efficient/very fast approximate string matching using this word and then getting the "value" related to each of the closest matches, i am not sure i understood what was the goal of the path down to tree. (not reaching the complete tree?), and when matched we just need to get the orginal word matched, (but my understanding can be wrong at this point).
so I need to have this huge dict somewhere and I wanted to encapsulate in the class to be convenient. But so may be it is too much costly from the memory "weight" point of view ?
also I noticed that I get already less memory usage than your sample (I do not know why for now), but so here is an example value of "letter" contained in the structure.
>>> s = u"\u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\\u266f'"
Low hanging fruit: use __slots__ in your node class, otherwise, each TrieNode object is carrying around a dict.
class TrieNode:
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
Now, each TrieNode object will not carry around an attribute dict. Compare the sizes:
>>> class TrieNode:
... def __init__(self):
... self.word = None
... self.children = {}
...
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
168
Vs:
>>> class TrieNode:
... __slots__ = "word", "children"
... def __init__(self):
... self.is_word = False
... self.children = {}
...
>>> sys.getsizeof(tn)
56
>>> tn.__dict__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'TrieNode' object has no attribute '__dict__'
Another optimization, use int objects. Small int objects are cached, it is probable most of your characters will be in that range anyway, but even if they aren't, an int, while still beefy in Python, is smaller than even a single character string:
>>> 'ñ'
'ñ'
>>> ord('ñ')
241
>>> sys.getsizeof('ñ')
74
>>> sys.getsizeof(ord('ñ'))
28
So you can do something like:
def insert( self, word, value):
node = self
for letter in word:
code_point = ord(letter)
if code_point not in node.children:
node.children[code_point] = TrieNode()
node = node.children[code_point]
node.is_word = True #Don't save the word, simply a reference to a singleton
Also, you are keeping around a class variable values dict that is growing enormously, but that information is redundant. You say:
I just add a dictionary to hold some value with quick access (I need
it)
You can reconstruct the words from the path. It should be relatively fast, I would seriously consider against having this dict. Check out how much memory it requires simply to hold a million one-character strings:
>>> d = {str(i):i for i in range(1000000)}
>>> (sum(sizeof(k)+sizeof(v) for k,v in d.items()) + sizeof(d)) * 1e-9
0.12483203000000001
You could do something like:
class TrieNode:
__slots__ = "value", "children"
def __init__(self):
self.value = None
self.children = {}
def insert( self, word, value):
node = self
for letter in word:
code_point = ord(letter)
if code_point not in node.children:
node.children[code_point] = TrieNode()
node = node.children[code_point]
node.value = value #this serves as a signal that it is a word
def get(word, default=None):
val = self._get_value(word)
if val is None:
return default
else:
return val
def _get_value(self, word):
node = self
for letter in word:
code_point = ord(letter)
try:
node = node.children[code_point]
except KeyError:
return None
return node.value
As part of a beginners' university Python project, I am currently creating a database of words, be it Nouns, Verbs, Determiners, Adjectives.
Now the problem I am having is that the words being read into the program via the lexicon.readfromfile method are being put into the dictionary via an instance of a class ( be it noun, verb or adjective ). This created the problem that I have absolutely no idea how to call these objects from the dictionary since they do not have variables as keys, but rather memory locations (see the following):
{<__main__.Verb object at 0x02F4F110>, <__main__.Noun object at 0x02F4F130>, <__main__.Adjective object at 0x02F4F1D0>, <__main__.Noun object at 0x02F4F170>}
Does anyone have any idea how I can call these keys in such a way that I can make them usable in my code?
Here is the part I'm stuck on:
Add a method getPast() to the Verb class, which returns the past tense of the Verb. Your getPast() method can simple work by retrieving the value of ‘past’ from the attributes.
Here is a the majority of the code, leaving out the Noun and Adjective classes:
class Lexicon(object):
'A container clas for word objects'
def __init__(self):
self.words = {}
def addword(self, word):
self.words[word.stringrep] = word
def removeword(self, word):
if word in self.words:
del(word)
print('Word has been deleted from the Lexicon' )
else:
print('That word is not in the Lexicon')
def getword(self,wordstring):
if wordstring in self.words:
return self.words[wordstring]
else:
return None
def containsword(self,string):
if string in self.words:
return True
else:
return False
def getallwords(self):
allwordslist = []
for w in self.words:
allwordslist.append(self.words[w])
return set(allwordslist)
def readfromfile(self, x):
filehandle = open(x, 'r')
while True:
line = filehandle.readline()
if line == '':
break
line = line.strip()
info = line.split(',')
if info[1] == 'CN' or info[1] == 'PN':
noun=Noun(info[0],info[1])
noun.setattribute('regular',bool(info[2]))
self.addword(noun)
elif info[1] == 'A':
adjective=Adjective(info[0],info[1])
adjective.setattribute('comparative', bool(info[2]))
self.addword(adjective)
elif info[1] == 'V':
verb=Verb(info[0],info[1])
verb.setattribute('transitive', bool(info[2]))
verb.setattribute('past', info[3])
self.addword(verb)
def writetofile(self, x):
filehandle = open(x, 'w')
for t in self.words.values():
filehandle.write(t.getFormattedString() + '\n')
filehandle.close()
#---------------------------------------------------------------------------#
class Word(object):
'A word of any category'
def __init__(self,stringrep,category):
self.wordattribute = {}
self.stringrep = stringrep
self.category = category
def setattribute(self, attributename, attributevalue):
self.wordattribute[attributename] = attributevalue
def getvalue(self,name):
if name in self.wordattribute:
return self.wordattribute[name]
else:
return none
def __str__(self):
return self.stringrep + ':' + self.category
def __lt__(self,otherword):
return self.stringrep < otherword.stringrep
class Verb(Word):
'"Represents a Verb."'
def __init__(self, stringrep, category):
super().__init__(stringrep,category)
def istransitive(self):
return self.transitive
def getFormattedString(self):
n = '{stringrep},{category}'
n = n.format(stringrep=self.stringrep, category=self.category)
for i in range(1,2):
for v,b in self.wordattribute.items():
n = n+','+str(b)
return n
You have a set there, not a dictionary. A set will let you check to see whether a given instance is in the set quickly and easily, but, as you have found, you can't easily get a specific value back out unless you already know what it is. That's OK because that's not what the set is for.
With a dictionary, you associate a key with a value when you add it to the dictionary. Then you use the key to get the value back out. So make a dictionary rather than a set, and use meaningful keys so you can easily get the value back.
Or, since I see you are already making a list before converting it to a set, just return that; you can easily access the items in the list by index. In other words, don't create the problem in the first place, and you won't have it.
I'm trying to implement an iterator class for not-necessarily-binary trees in Python. After the iterator is constructed with a tree's root node, its next() function can be called repeatedly to traverse the tree in depth-first order (e.g., this order), finally returning None when there are no nodes left.
Here is the basic Node class for a tree:
class Node(object):
def __init__(self, title, children=None):
self.title = title
self.children = children or []
self.visited = False
def __str__(self):
return self.title
As you can see above, I introduced a visited property to the nodes for my first approach, since I didn't see a way around it. With that extra measure of state, the Iterator class looks like this:
class Iterator(object):
def __init__(self, root):
self.stack = []
self.current = root
def next(self):
if self.current is None:
return None
self.stack.append(self.current)
self.current.visited = True
# Root case
if len(self.stack) == 1:
return self.current
while self.stack:
self.current = self.stack[-1]
for child in self.current.children:
if not child.visited:
self.current = child
return child
self.stack.pop()
This is all well and good, but I want to get rid of the need for the visited property, without resorting to recursion or any other alterations to the Node class.
All the state I need should be taken care of in the iterator, but I'm at a loss about how that can be done. Keeping a visited list for the whole tree is non-scalable and out of the question, so there must be a clever way to use the stack.
What especially confuses me is this--since the next() function, of course, returns, how can I remember where I've been without marking anything or using excess storage? Intuitively, I think of looping over children, but that logic is broken/forgotten when the next() function returns!
UPDATE - Here is a small test:
tree = Node(
'A', [
Node('B', [
Node('C', [
Node('D')
]),
Node('E'),
]),
Node('F'),
Node('G'),
])
iter = Iterator(tree)
out = object()
while out:
out = iter.next()
print out
If you really must avoid recursion, this iterator works:
from collections import deque
def node_depth_first_iter(node):
stack = deque([node])
while stack:
# Pop out the first element in the stack
node = stack.popleft()
yield node
# push children onto the front of the stack.
# Note that with a deque.extendleft, the first on in is the last
# one out, so we need to push them in reverse order.
stack.extendleft(reversed(node.children))
With that said, I think that you're thinking about this too hard. A good-ole' (recursive) generator also does the trick:
class Node(object):
def __init__(self, title, children=None):
self.title = title
self.children = children or []
def __str__(self):
return self.title
def __iter__(self):
yield self
for child in self.children:
for node in child:
yield node
both of these pass your tests:
expected = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
# Test recursive generator using Node.__iter__
assert [str(n) for n in tree] == expected
# test non-recursive Iterator
assert [str(n) for n in node_depth_first_iter(tree)] == expected
and you can easily make Node.__iter__ use the non-recursive form if you prefer:
def __iter__(self):
return node_depth_first_iter(self)
That could still potentially hold every label, though. I want the
iterator to keep only a subset of the tree at a time.
But you already are holding everything. Remember that an object is essentially a dictionary with an entry for each attribute. Having self.visited = False in the __init__ of Node means you are storing a redundant "visited" key and False value for every single Node object no matter what. A set, at least, also has the potential of not holding every single node ID. Try this:
class Iterator(object):
def __init__(self, root):
self.visited_ids = set()
...
def next(self):
...
#self.current.visited = True
self.visited_ids.add(id(self.current))
...
#if not child.visited:
if id(child) not in self.visited_ids:
Looking up the ID in the set should be just as fast as accessing a node's attribute. The only way this can be more wasteful than your solution is the overhead of the set object itself (not its elements), which is only a concern if you have multiple concurrent iterators (which you obviously don't, otherwise the node visited attribute couldn't be useful to you).
I have created a tree object in python using the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re,sys,codecs
neg_markers_en=[u'not',u"napt",u'no',u'nobody',u'none',u'never']
class Node:
def __init__(self,name=None,parent=None,sentence_number=0):
self.name=name
self.next=list()
self.parent=parent
self.depth=0
self.n_of_neg=0
self.subordinate=None
self.foo=None
def print_node(self):
print self.name,'contains',[(x.name,x.depth,x.foo) for x in self.next]
for x in self.next:
x.print_node()
def get_negation(self):
for x in self.next:
if x.n_of_neg!=0:
print unicode(x.depth)+u' |||',
try:
x.look_for_parent_vp()
except: print 'not in a VP',
try:
x.look_for_parent_sent()
except: print '***'
x.get_negation()
def look_for_parent_vp(self):
if self.parent.name=='VP':
self.parent.print_nont()
else:
self.parent.look_for_parent_vp()
def look_for_parent_sent(self):
if self.parent.name=='S' or self.parent.name=='SBAR':
#This is to send out to a text file, along with what it covers
print '||| '+ self.parent.name,
try:
self.parent.check_subordinate()
self.parent.print_nont()
print '\n'
except:
print u'no sub |||',
self.parent.print_nont()
print '\n'
elif self.parent=='None': print 'root |||'
else:
self.parent.look_for_parent_sent()
def print_nont(self):
for x in self.next:
if x.next==[]:
print unicode(x.name),
else: x.print_nont()
def mark_subordinate(self):
for x in self.next:
if x.name=='SBAR':
x.subordinate='sub'
else: x.subordinate='main'
x.mark_subordinate()
def check_subordinate(self):
if self.subordinate=='sub':
print u'sub |||',
else:
self.parent.check_subordinate()
def create_tree(tree):
#replace "n't" with 'napt' so to avoid errors in splitting
tree=tree.replace("n't",'napt')
lista=filter(lambda x: x!=' ',re.findall(r"\w+|\W",tree))
start_node=Node(name='*NULL*')
current_node=start_node
for i in range(len(lista)-1):
if lista[i]=='(':
next_node=Node()
next_node.parent=current_node
next_node.depth=current_node.depth+1
current_node.next.append(next_node)
current_node=next_node
elif lista[i]==')':
current_node=current_node.parent
else:
if lista[i-1]=='(' or lista[i-1]==')':
current_node.name=lista[i]
else:
next_node=Node()
next_node.name=lista[i]
next_node.parent=current_node
#marks the depth of the node
next_node.depth=current_node.depth+1
if lista[i] in neg_markers_en:
current_node.n_of_neg+=1
current_node.next.append(next_node)
return start_node
Now all the nodes are linked so that the children nodes of a parent node are appended to a list and each one of these child nodes are referred back to their parent through the instance parent.
I have the following problem:
For each node whose name is 'S' or 'SBAR' (let's call it node_to_check), I have to look if any of its children node's name is either 'S' or 'SBAR'; if this is NOT the case I want to transform .foo attribute of the node_to_check into 'atom'.
I was thinking of something like this:
def find_node_to_check(self):
for next in self.next:
if next.name == 'S' or next.name == 'SBAR':
is_present = check_children(next)
if is_present == 'no':
find_node_to_check(next)
else:
self.foo = 'atom'
def check_children(self):
for next in self.next:
# is this way of returning correct?
if next.name == 'S' or next.name == 'SBAR':
return 'no'
else:
check_sents(next)
return 'yes'
I included in my question also the code that I have written so far. A tree structure is created in the function create_tree(tree); the input tree is a bracketed notation from the Stanford Parser.
When trying to design a novel class, knowing what you need it to do informs how you construct it. Stubbing works well here, for example:
class Node:
"""A vertex of an n-adic tree"""
def __init__(self, name):
"""since you used sentence, I assumed n-adic
but that may be wrong and then you might want
left and right children instead of a list or dictionary
of children"""
pass
def append_children(self, children):
"""adds a sequence of child Nodes to self"""
pass
def create_child(self, name):
"""creates a new Named node and adds it as a child"""
pass
def delete_child(self, name):
"""deletes a named child from self or throws exception"""
pass
And so on. Do children need to be ordered? Do you ever need to delete a node (and descendants)? Would you be able to pre-build a list of children or would you have to do it one at a time. Do you really want to store the fact that a Node is terminal (that's redundant) or do you want is_terminal() to return children is None?