Decreasing memory usage with python objects trie served using flask - python

I am looking for some tips about how to decrease memory usage for python. I am using this piece of code as the main structure to hold my data:
http://stevehanov.ca/blog/index.php?id=114
I need it to serve for proximity word matching using a flask server. I need to put much more than 20 millions of different strings (and it will increase). Now I get MemoryError when trying to put around 14 millions in the Trie.
I just add a dictionary to hold some value with quick access (I need it, but it can be considered as a kind of ID of appearance, it is not directly related to the word)
class TrieNode:
values = {}
def __init__(self):
self.word = None
self.children = {}
global NodeCount
NodeCount += 1
def insert( self, word, value):
node = self
for letter in word:
if letter not in node.children:
node.children[letter] = TrieNode()
node = node.children[letter]
TrieNode.values[word] = value
node.word = word
I am not familiar with Python optimization, is there any way to make the "letter" object less big to save some memory?
Please note that my difficulty come from the fact that this letter is not only [a-z] but need to handle all the "unicode range" (like accentuated chars but not only). BTW it is a single character, so it should be quite light from the memory fingerprint. How can I use the codepoint instead of the string object (will it be more memory efficient)?
EDIT: adding some other informations following reply from #juanpa-arrivillaga
so, first I see no difference using the slot construct, on my computer, with or without __slot__ I see the same memory usage.
with __slot__ :
>>> class TrieNode:
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount # my goal is to encapsulated the NodeCount in the class itself
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
without __slot__:
>>> class TrieNode:
NodeCount = 0
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
so I do not understand, why. Where am i wrong ?
here is something else what I tried too, using "intern" keyword, because this value is a string handling an "id" (and so is not related to unicode, not like letter) :
btw my goal was to have with values and NodeCount, the equivalent concept for class/static variables so that each of them is shared by all the instance of the small created objets, I thought it would preserve memory and avoid duplicate, but I may be wrong from my understanding about "static-like" concept in Python)
class TrieNode:
values = {} # shared amon all instances so only one structure?
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
def insert( self, word, value = None):
# value is a string id like "XYZ999999999"
node = self
for letter in word:
codepoint = ord(letter)
if codepoint not in node.children:
node.children[codepoint] = TrieNode()
node = node.children[codepoint]
node.word = word
if value is not None:
lost = TrieNode.values.setdefault(word, [])
TrieNode.values[word].append(intern(str(value)))
ADDED:
Last, I should have precised that i am using Python 2.7.x family.
I was wondering if there were any fixed len data types from library like numpy could help me to save some memory, again as new, i do not know where to look. Btw "word" are not real "natural language word" but "arbitrary length sequence of characters" and they can also be very long.
from your reply, I agree that avoiding to store the word in each node would be efficient, but you need to have a look to the linked article/piece of code. The main goal is not to reconstruct this word but to be able to do efficient/very fast approximate string matching using this word and then getting the "value" related to each of the closest matches, i am not sure i understood what was the goal of the path down to tree. (not reaching the complete tree?), and when matched we just need to get the orginal word matched, (but my understanding can be wrong at this point).
so I need to have this huge dict somewhere and I wanted to encapsulate in the class to be convenient. But so may be it is too much costly from the memory "weight" point of view ?
also I noticed that I get already less memory usage than your sample (I do not know why for now), but so here is an example value of "letter" contained in the structure.
>>> s = u"\u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\\u266f'"

Low hanging fruit: use __slots__ in your node class, otherwise, each TrieNode object is carrying around a dict.
class TrieNode:
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
Now, each TrieNode object will not carry around an attribute dict. Compare the sizes:
>>> class TrieNode:
... def __init__(self):
... self.word = None
... self.children = {}
...
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
168
Vs:
>>> class TrieNode:
... __slots__ = "word", "children"
... def __init__(self):
... self.is_word = False
... self.children = {}
...
>>> sys.getsizeof(tn)
56
>>> tn.__dict__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'TrieNode' object has no attribute '__dict__'
Another optimization, use int objects. Small int objects are cached, it is probable most of your characters will be in that range anyway, but even if they aren't, an int, while still beefy in Python, is smaller than even a single character string:
>>> 'ñ'
'ñ'
>>> ord('ñ')
241
>>> sys.getsizeof('ñ')
74
>>> sys.getsizeof(ord('ñ'))
28
So you can do something like:
def insert( self, word, value):
node = self
for letter in word:
code_point = ord(letter)
if code_point not in node.children:
node.children[code_point] = TrieNode()
node = node.children[code_point]
node.is_word = True #Don't save the word, simply a reference to a singleton
Also, you are keeping around a class variable values dict that is growing enormously, but that information is redundant. You say:
I just add a dictionary to hold some value with quick access (I need
it)
You can reconstruct the words from the path. It should be relatively fast, I would seriously consider against having this dict. Check out how much memory it requires simply to hold a million one-character strings:
>>> d = {str(i):i for i in range(1000000)}
>>> (sum(sizeof(k)+sizeof(v) for k,v in d.items()) + sizeof(d)) * 1e-9
0.12483203000000001
You could do something like:
class TrieNode:
__slots__ = "value", "children"
def __init__(self):
self.value = None
self.children = {}
def insert( self, word, value):
node = self
for letter in word:
code_point = ord(letter)
if code_point not in node.children:
node.children[code_point] = TrieNode()
node = node.children[code_point]
node.value = value #this serves as a signal that it is a word
def get(word, default=None):
val = self._get_value(word)
if val is None:
return default
else:
return val
def _get_value(self, word):
node = self
for letter in word:
code_point = ord(letter)
try:
node = node.children[code_point]
except KeyError:
return None
return node.value

Related

is it valid to use a constructor inside defaultdict

I was trying to see if we can use a constructor inside defaultdict and I am not able to run the code and get a recursion error. Just wondering if it is possible:
from collections import defaultdict
class TrieNode:
def __init__(self, char):
self.children = defaultdict(TrieNode(char))
self.is_word = False
a = TrieNode('b')
There is nothing wrong with using a defaultdict in you constructor. The problem is that you need to pass it a function that it will call when you add new keys. You are currently calling the function when you make the dictionary. As a result you keep calling TrieNode('b') infinitely.
You need to call it with something like:
self.children = defaultdict(TrieNode)
Then when you reference an unknown key in children it will call TrieNode() for you. This means, however, that you don't want to take an additional argument in the constructor.
That's probably ok because you generally add words to a trie and will need to add many words through the same node. One option would be to do something like:
from collections import defaultdict
class TrieNode:
def __init__(self):
self.children = defaultdict(TrieNode)
self.is_word = False
self.val = ''
def add(self, word):
self.val= word[0]
if (len(word) == 1):
self.is_word = True
else:
self.children[word[0]].add(word[1:])
def words(self):
if self.is_word:
yield self.val
for letter, node in self.children.items():
yield from (letter + child for child in node.words())
You can then add words to it and it will make TrieNodes in the default dictionary as it goes:
node = TrieNode()
node.add("dog")
node.add("catnip")
node.add("cats")
node.add("cat")
node.add("crunch")
node.children['c'].children
> defaultdict(__main__.TrieNode,
{'a': <__main__.TrieNode at 0x179c70048>,
'r': <__main__.TrieNode at 0x179c70eb8>})
You can see that your children has a c key which points to a TrieNode whose children is the defaultdict with a and r pointing to the next.
This allows you to easily pull out the words with a generator:
list(node.words())
> ['dog', 'cat', 'cats', 'catnip', 'crunch']

Dictionary to "inherit" from class of objects that I use it to store; e.g. my_dict['item1'].method1() or mydict['item1'].property1 = 'new'

I've searched for couple days now but couldn't find anything so maybe it's not possible.
As in topic, is there a way for a dict to "inherit" from a class so all the methods and properties are visible via "intelisens" ?
Example
class Word(object):
def __init__(self, word):
self.word = word
self.base_word = ''
self.derived_words = set()
self.sub_words = set()
self.frequency = 0
def method_1(self):
do something here
def method_2(self):
do something else here
myDict['computer'] = Word('computer')
myDict['notebook'] = Word('notebook')
and then i can obviously do this and it will work
mydict['computer'].method_2()
mydict['notebook'].frequency = 12
but i'd like to know if there is a way to make myDict object know that this methods and properties of the object are available and they would show up in "intelisense".
Picture Example
I'm using PyCharm.
Best regards
Bartek
PyCharm does support type hinting. I'm assuming you are using Python 2.7 since you are inheriting object. I have 3 ways you could do this in Python 2.7.
1
myDict['computer'] = Word('computer')
computer1 = myDict['computer'] # type: Word
2
computer2 = myDict['computer']
""":type : Word """
Now when you use computer1 or computer2 you should get the drop down that you like.
3
If your dictionary always returns a Word object you could create a dict object which inherits from dict, and make a function which returns Word type using rtype in the method docstring.
class Word(object):
def __init__(self, word):
self.word = word
def foo(self):
return self.word
class FooDict(dict):
def __init__(self, *arg, **kw):
super(FooDict, self).__init__(*arg, **kw)
def get_item(self, item):
"""
My docstring.
:return: a class of the type I want.
:rtype: Word
"""
return self[item]
a = Word("foo")
d = FooDict()
d['computer'] = a
computer3 = d.get_item('computer')
Now computer3 should behave how you like.
Reposting user2235698 answer from comment to my first post, for future googlers
Just add type hint to d definition.
It will be like d = {} # type: typing.Dict[str, Word] or d: typing.Dict[str, Word] = {} (since Python 3.6) – user2235698 21 hours ago
so now code looks like this:
import typing
class Word():
def __init__(self, word):
self.word = word
self.base_word = ''
self.derived_words = set()
self.sub_words = set()
self.meaning = ''
self.reading = ''
self.frequency = ''
self.sub_chars = set()
def analyze_sub_words(self):
print('working')
myComputer = Word('computer')
d: typing.Dict[str, Word] = {}
d['computer'] = myComputer
# now hints are being displayed
d['computer'].base_word

Retrieving a class object from a dictionary

As part of a beginners' university Python project, I am currently creating a database of words, be it Nouns, Verbs, Determiners, Adjectives.
Now the problem I am having is that the words being read into the program via the lexicon.readfromfile method are being put into the dictionary via an instance of a class ( be it noun, verb or adjective ). This created the problem that I have absolutely no idea how to call these objects from the dictionary since they do not have variables as keys, but rather memory locations (see the following):
{<__main__.Verb object at 0x02F4F110>, <__main__.Noun object at 0x02F4F130>, <__main__.Adjective object at 0x02F4F1D0>, <__main__.Noun object at 0x02F4F170>}
Does anyone have any idea how I can call these keys in such a way that I can make them usable in my code?
Here is the part I'm stuck on:
Add a method getPast() to the Verb class, which returns the past tense of the Verb. Your getPast() method can simple work by retrieving the value of ‘past’ from the attributes.
Here is a the majority of the code, leaving out the Noun and Adjective classes:
class Lexicon(object):
'A container clas for word objects'
def __init__(self):
self.words = {}
def addword(self, word):
self.words[word.stringrep] = word
def removeword(self, word):
if word in self.words:
del(word)
print('Word has been deleted from the Lexicon' )
else:
print('That word is not in the Lexicon')
def getword(self,wordstring):
if wordstring in self.words:
return self.words[wordstring]
else:
return None
def containsword(self,string):
if string in self.words:
return True
else:
return False
def getallwords(self):
allwordslist = []
for w in self.words:
allwordslist.append(self.words[w])
return set(allwordslist)
def readfromfile(self, x):
filehandle = open(x, 'r')
while True:
line = filehandle.readline()
if line == '':
break
line = line.strip()
info = line.split(',')
if info[1] == 'CN' or info[1] == 'PN':
noun=Noun(info[0],info[1])
noun.setattribute('regular',bool(info[2]))
self.addword(noun)
elif info[1] == 'A':
adjective=Adjective(info[0],info[1])
adjective.setattribute('comparative', bool(info[2]))
self.addword(adjective)
elif info[1] == 'V':
verb=Verb(info[0],info[1])
verb.setattribute('transitive', bool(info[2]))
verb.setattribute('past', info[3])
self.addword(verb)
def writetofile(self, x):
filehandle = open(x, 'w')
for t in self.words.values():
filehandle.write(t.getFormattedString() + '\n')
filehandle.close()
#---------------------------------------------------------------------------#
class Word(object):
'A word of any category'
def __init__(self,stringrep,category):
self.wordattribute = {}
self.stringrep = stringrep
self.category = category
def setattribute(self, attributename, attributevalue):
self.wordattribute[attributename] = attributevalue
def getvalue(self,name):
if name in self.wordattribute:
return self.wordattribute[name]
else:
return none
def __str__(self):
return self.stringrep + ':' + self.category
def __lt__(self,otherword):
return self.stringrep < otherword.stringrep
class Verb(Word):
'"Represents a Verb."'
def __init__(self, stringrep, category):
super().__init__(stringrep,category)
def istransitive(self):
return self.transitive
def getFormattedString(self):
n = '{stringrep},{category}'
n = n.format(stringrep=self.stringrep, category=self.category)
for i in range(1,2):
for v,b in self.wordattribute.items():
n = n+','+str(b)
return n
You have a set there, not a dictionary. A set will let you check to see whether a given instance is in the set quickly and easily, but, as you have found, you can't easily get a specific value back out unless you already know what it is. That's OK because that's not what the set is for.
With a dictionary, you associate a key with a value when you add it to the dictionary. Then you use the key to get the value back out. So make a dictionary rather than a set, and use meaningful keys so you can easily get the value back.
Or, since I see you are already making a list before converting it to a set, just return that; you can easily access the items in the list by index. In other words, don't create the problem in the first place, and you won't have it.

Interfering objects (or something less sinister?)

I have created a word object, which consists of just two methods, and takes just two parameters. In spite of this apparent simplicity it is behaving in a way that's beyond my comprehension: if I create two instances of the same object, with the same first argument ("dissembling" in this case) the second instance somehow interferes with the first. Printing the instances reveals that they are indeed separate, so why are the interacting in this way?
# Example tested with Python 2.7.3
from collections import namedtuple
DefinitionTuple = namedtuple("Definition", "word word_id text pos")
class Word(object):
def __init__(self, word, defs=None):
""""""
self.definitions = []
self.word = word
if defs != None:
for each in defs:
try:
each.pos
if each.word.lower() == self.word.lower():
self.definitions.append(each)
except AttributeError:
raise AttributeError("Definitions must be named tuples")
self.orderDefinitions()
def orderDefinitions(self):
""""""
ordered = sorted(self.definitions, key=lambda definition: definition.pos)
for i,each in enumerate(ordered):
each.pos = (i+1)
self.definitions = ordered
class Definition(object):
""""""
def __init__(self, definition):
"""Incoming arg is a single namedtuple"""
self.word = definition.word
self.word_id = definition.word_id
self.text = definition.text
self.pos = definition.pos
if __name__ == "__main__":
nt1 = DefinitionTuple("dissemble", 5, "text_string_a", 1)
nt2 = DefinitionTuple("dissemble", 5, "text_string_b)", 2)
nt3 = DefinitionTuple("dissemble", 5, "text_string_c", 3)
# Definiton objects
def_1 = Definition(nt1)
def_2 = Definition(nt2)
def_3 = Definition(nt3)
dissemble = Word("dissemble", [def_1, def_2, def_3])
print "first printing: "
for each in dissemble.definitions:
print each.pos, each.text
# create a new instance of Word ...
a_separate_instance = Word("dissemble", [def_3])
# ... and now the 'pos' ordering of my first instance is messed up!
print "\nnow note how numbers differ compared with first printing:"
for each in dissemble.definitions:
print each.pos, each.text
You create a new instance of Word, but you reuse the same instance of def_3:
a_separate_instance = Word("dissemble", [def_3])
which is stateful. If we look inside using vars:
print vars(def_3)
# create a new instance of Word ...
a_separate_instance = Word("dissemble", [def_3])
print vars(def_3)
We see
{'text': 'text_string_c', 'word': 'dissemble', 'pos': 3, 'word_id': 5}
{'text': 'text_string_c', 'word': 'dissemble', 'pos': 1, 'word_id': 5}
due to orderDefinitions.
In your orderDefinitions method, you are modifying the pos attribute of your Definition objects:
each.pos = (i+1)
So when you call orderDefinitions a second time, you will be doing def_3.pos = 1.
But, dissemble holds a reference to this def_3 object, whose pos attribute has now changed, hence your issue.

Python; Linked list and traversing!

Starting some programming with python at school now, and I don't know how to proceed with this problem. Any thoughts?
Input consists of integer separated by line breaks. Your program should submit them in a linked list, traverse the linked list and print the highest number.
Something to take the first number, and do an action which says "if the next number is bigger, take that one, else, keep the current number, and head down the list and repeat"
Then when it gets to the end of the list, it prints the value it has.
from sys import stdin
class Kubbe:
vekt = None
neste = None
def __init__(self, vekt):
self.vekt = vekt
self.neste = None
def spor(kubbe):
# WRITE YOUR CODE HERE
# Creates linked list
forste = None
siste = None
for linje in stdin:
forrige_siste = siste
siste = Kubbe(int(linje))
if forste == None:
forste = siste
else:
forrige_siste.neste = siste
# Calls the solution function and prints the result
print spor(forste)
Input: example
54
37
100
123
1
54
Required output
123
"Linked lists" are rarely used in Python -- normally, one uses just list, the Python built-in list, which is actually more of a "dynamic vector". So, it's peculiar to see a linked list specified as part of the exercise's constraints.
But the main point is, the code you're showing is already creating a linked list -- the head is at forste, and, for each node, the next-node pointer at .neste, the payload at .vekt. So, presumably, that's not what you're asking about, no matter the text of your question.
The simple way to loop through your linked list once you have fully constructed it (i.e., at the end of the current code for spor) is
current = forste
while current is not None:
...process current.vekt...
current = current.neste
In your case, the logic for the "process" part is of course, as your Q's text already says:
if current.vekt > themax:
themax = current.vekt
The only subtlety is, you need to initially set themax, before this while loop to "the lowest possible number"; in recent versions of Python, "minus infinity" is reliably recorded and compared (though only as a float, it still compares correctly to ints), so
themax = float('-inf')
would work. More elegant might be to initially set the maximum to the first payload, avoiding messing with infinity.
Here's an answer based on your own code and language. Sorry if the new variable and function names do not translate well, as I don't speak Norwegian (Google Language Tools is my friend).
Comment: Like airplane Air Traffic Control the default language of most international programming forums such as StackOverflow is English. If you use it, you are likely to get quicker, better, and more answers -- and it probably makes the question and related answers useful to the largest number of other folks. Just my 2 øre... ;-)
from sys import stdin
class Kubbe:
vekt = None
neste = None
def __init__(self, vekt):
self.vekt = vekt
self.neste = None
def spor():
# WRITE YOUR CODE HERE
# Creates linked list
forste = None
siste = None
while True:
try:
linje = raw_input()
except EOFError:
break
forrige_siste = siste
siste = Kubbe(int(linje))
if forste == None:
forste = siste
else:
forrige_siste.neste = siste
return forste
def finne_maksimal(lenketliste):
storste = None
if lenketliste is not None:
storste = lenketliste.vekt
gjeldende = lenketliste.neste
while gjeldende is not None:
if gjeldende.vekt > storste:
storste = gjeldende.vekt
gjeldende = gjeldende.neste
return storste
lenketliste = spor()
storste = finne_maksimal(lenketliste)
if lenketliste is None:
print "tom liste"
else:
print "storste er", storste
There is a builtin function in Python called reduce, which traverses a list and "compresses" it with a given function. That is, if you have a list of five elements [a,b,c,d,e] and a function f, it will effectively do
temp = f(a,b)
temp = f( temp, c )
...
You should be able to use this to write a very neat solution.
If you want to be less abstract, you will need to iterate over each element of the list in turn, storing the greatest number so far in a variable. Change the variable only if the element you have reached is greater than the value of said variable.
This seems to work with your input (works in both python 2 and 3). Notice how max works with duck typing of Python!
This version works with Python3 also from file.
import sys
class Kubbe:
vekt = None
neste = None
def __init__(self, vekt):
self.vekt = vekt
self.neste = None
def spor():
# WRITE YOUR CODE HERE
# Creates linked list
forste = None
siste = None
while True:
linje = sys.stdin.readline().rstrip()
if not linje:
break
forrige_siste, siste = siste, Kubbe(int(linje))
if forste is None:
forste = siste
else:
forrige_siste.neste = siste
return forste
def traverse(linkedlist):
while linkedlist is not None:
yield linkedlist.vekt
linkedlist=linkedlist.neste
# Calls the solution function and prints the result
linkedlist=spor()
for item in traverse(linkedlist):
print(item)
# use builtin max:
print('Maximum is %i' % max(traverse(linkedlist)))
# if not allowed:
m = linkedlist.vekt
for item in traverse(linkedlist.neste):
if item > m: m = item
print(m)
The below code would work. The Node class represents the LinkedList Node. The LinkedList class defines the methods to add node at the end of the Linked List and find_max will traverse through the list and return the node with largest key.
class Node(object):
def __init__(self, key, next_node):
self.key = key
self.next_node = next_node
class LinkedList(object):
def __init__(self):
self.head = None
def append(self, key):
# Create a new Node
new_node = Node(key, None)
if (self.head == None):
self.head = new_node
else:
tmp = self.head
while(tmp.next_node != None):
tmp = tmp.next_node
tmp.next_node = new_node
def find_max(self):
tmp = self.head
max_num = 0
while(tmp != None):
if (tmp.key > max_num):
max_num = tmp.key
tmp = tmp.next_node
return max_num

Categories