How to decrease the memory footprint of dictionary? - python

In my application, I need a fast look up of attributes. Attributes are in this case a composition of a string and a list of dictionaries. These attributes are stored in a wrapper class. Let's call this wrapper class Plane:
class Plane(object):
def __init__(self, name, properties):
self.name = name
self.properties = properties
#classmethod
def from_idx(cls, idx):
if idx == 0:
return cls("PaperPlane", [{"canFly": True}, {"isWaterProof": False}])
if idx == 1:
return cls("AirbusA380", [{"canFly": True}, {"isWaterProof": True}, {"hasPassengers": True}])
To better play with this class, I added a simple classmethod to construct instances by providing and integer.
So now in my application I have many Planes, of the order of 10,000,000. Each of these planes can be accessed by a universal unique id (uuid). What I need is a fast lookup: given an uuid, what is the Plane. The natural solution is a dict. A simple class to generate planes with uuids in a dict and to store this dict in a file may look like this:
class PlaneLookup(object):
def __init__(self):
self.plane_dict = {}
def generate(self, n_planes):
for i in range(n_planes):
plane_id = uuid.uuid4().hex
self.plane_dict[plane_id] = Plane.from_idx(np.random.randint(0, 2))
def save(self, filename):
with gzip.open(filename, 'wb') as f:
pickle.dump(self.plane_dict, f, pickle.HIGHEST_PROTOCOL)
#classmethod
def from_disk(cls, filename):
pl = cls()
with gzip.open(filename, 'rb') as f:
pl.plane_dict = pickle.load(f)
return pl
So now what happens is that if I generate some planes?
pl = PlaneLookup()
pl.generate(1000000)
What happens is, that lots of memory gets consumed! If I check the size of my pl object with the getsize() method from this question, I get on my 64bit machine a value of 1,087,286,831 bytes. Looking at htop, my memory demand seems to be even higher (around 2GB).
In this question, it is explained quite well, why python dictionaries need much memory.
However, I think this does not have to be the case in my application. The plane object that is created in the PlaneLookup.generate() method contains very often the same attributes (i.e. the same name and the same properties). So it has to be possible, to save this object once in the dict and whenever the same object (same name, same attribute) is created again, only a reference to the already existing dict entry is stored. As a simple Plane object has a size of 1147 bytes (according to the getsize() method), just saving references may save a lot of memory!
The question is now: How do I do this? In the end I need a function that takes a uuid as an input and returns the corresponding Plane object as fast as possible with as little memory as possible.
Maybe lru_cache can help?
Here is again the full code to play with:
https://pastebin.com/iTZyQQAU

Did you think about having another dictionary with idx -> plane? then in self.plane_dict[plane_uuid] you would just store idx instead of object. this will save memory and speed up your app, though you'd need to modify the lookup method.

Related

Organize functions that create or expand a text file using a class?

I'm brand new to classes and I don't really know when to use them. I want to write a program for simulation of EPR/NMR spectra which requires information about the simulated system. The relevant thing is this: I have a function called rel_inty(I_n,N) that calculates this relevant information from two values. The problem is that it becomes very slow when either of these values becomes large (I_n,N >= 10). That's why I opted for calculating rel_inty(I_n,N) beforehand for the most relevant combinations of (I_n,N) and save them in a dictionary. I write that dictionary to a file and can import it using eval(), since calculating rel_inty(I_n,N) dynamically on each execution would be way too slow.
Now I had the following idea: What if I create a class manage_Dict():, whose methods can either recreate a basic dictionary with adef basic(): , in case the old file somehow gets deleted, or expand the existing one with a def expand(): method, if the basic one doesn't contain a user specified combination of (I_n,N)?
This would be the outline of that class:
class manage_Dict(args):
def rel_inty(I_n,N):
'''calculates relative intensities for a combination (I_n,N)'''
def basic():
'''creates a dict for preset combinations of I_n,N'''
with open('SpinSys.txt','w') as outf:
Dict = {}
I_n_List = [somevalues]
N_List = [somevalues]
for I_n in I_n_List:
Dict[I_n] = {}
for N in N_List:
Dict[I_n][N] = rel_inty(I_n,N)
outf.write(str(Dict))
def expand(*args):
'''expands the existing dict for all tuples (I_n,N) in *args'''
with open('SpinSys.txt','r') as outf:
Dict = eval(outf.read())
for tup in args:
I_n = tup[0]
N = tup[1]
Dict[I_n][N] = rel_inty(I_n,N)
os.remove('SpinSys.txt')
with open('SpinSys.txt','w') as outf:
outf.write(str(Dict))
Usage:
'''Recreate SpinSys.txt if lost'''
manage_Dict.basic()
'''Expand SpinSys.txt in case of missing (I_n,N)'''
manage_Dict.expand((10,5),(11,3),(2,30))
Would this be a sensible solution? I was wondering that because I usually see classes with self and __init__ creating an object instance instead of just managing function calls.
If we are going to make use of an object, lets make sure it's doing some useful work for us and the interface is nicer than just using functions. I'm going to suggest a few big tweaks that will make life easier:
We can sub class dict itself, and then our object is a dict, as well as all our custom fancy stuff
Use JSON instead of text files, so we can quickly, naturally and safely serialise and deserialise
import json
class SpectraDict(dict):
PRE_CALC_I_N = ["...somevalues..."]
PRE_CACL_N = ["...somevalues..."]
def rel_inty(self, i_n, n):
# Calculate and store results from the main function
if i_n not in self:
self[i_n] = {}
if n not in self[i_n]:
self[i_n][n] = self._calculate_rel_inty(i_n, n)
return self[i_n][n]
def _calculate_rel_inty(self, i_n, n):
# Some exciting calculation here instead...
return 0
def pre_calculate(self):
s_dict = SpectraDict()
for i_n in self.PRE_CALC_I_N:
for n in self.PRE_CACL_N:
# Force the dict to calculate and store the values
s_dict.rel_inty(i_n, n)
return s_dict
#classmethod
def load(cls, json_file):
with open(json_file) as fh:
return SpectraDict(json.load(fh))
def save(self, json_file):
with open(json_file, 'w') as fh:
json.dump(self, fh)
return self
Now when ask for values using the rel_inty() function we immediately store the answer in ourselves before giving it back. This is called memoization / caching. Therefore to pre-fill our object with the pre-calculated values, we just need to ask it for lots of answers and it will store them.
After that we can either load or save quite naturally using JSON:
# Bootstrapping from scratch:
s_dict = SpectraDict().pre_calculate().save('spin_sys.json')
# Loading and updating with new values
s_dict = SpectraDict.load('spin_sys.json')
s_dict.rel_inty(10, 45) # All your new calculations here...
s_dict.save('spin_sys.json')

Storing a data for recalling functions Python

I have a project in which I run multiple data through a specific function that "cleans" them.
The cleaning function looks like this:
Misc.py
def clean(my_data)
sys.stdout.write("Cleaning genes...\n")
synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms()
clean_genes = {}
for g in data:
if g in synonyms:
# Found a data point which appears in the synonym list.
#print synonyms[g]
for synonym in synonyms[g]:
if synonym in data:
del data[synonym]
clean_data[g] = synonym
sys.stdout.write("\t%s is also known as %s\n" % (g, clean_data[g]))
return data
FileIO is a custom class I made to open files.
My question is, this function will be called many times throughout the program's life cycle. What I want to achieve is don't have to read the input_data every time since it's gonna be the same every time. I know that I can just return it, and pass it as an argument in this way:
def clean(my_data, synonyms = None)
if synonyms == None:
...
else
...
But is there another, better looking way of doing this?
My file structure is the following:
lib
Misc.py
FileIO.py
__init__.py
...
raw_data
runme.py
From runme.py, I do this from lib import * and call all the functions I made.
Is there a pythonic way to go around this? Like a 'memory' for the function
Edit:
this line: synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms() returns a collections.OrderedDict() from input_data and using the 3rd column as the key of the dictionary.
The dictionary for the following dataset:
column1 column2 key data
... ... A B|E|Z
... ... B F|W
... ... C G|P
...
Will look like this:
OrderedDict([('A',['B','E','Z']), ('B',['F','W']), ('C',['G','P'])])
This tells my script that A is also known as B,E,Z. B as F,W. etc...
So these are the synonyms. Since, The synonyms list will never change throughout the life of the code. I want to just read it once, and re-use it.
Use a class with a __call__ operator. You can call objects of this class and store data between calls in the object. Some data probably can best be saved by the constructor. What you've made this way is known as a 'functor' or 'callable object'.
Example:
class Incrementer:
def __init__ (self, increment):
self.increment = increment
def __call__ (self, number):
return self.increment + number
incrementerBy1 = Incrementer (1)
incrementerBy2 = Incrementer (2)
print (incrementerBy1 (3))
print (incrementerBy2 (3))
Output:
4
5
[EDIT]
Note that you can combine the answer of #Tagc with my answer to create exactly what you're looking for: a 'function' with built-in memory.
Name your class Clean rather than DataCleaner and the name the instance clean. Name the method __call__ rather than clean.
Like a 'memory' for the function
Half-way to rediscovering object-oriented programming.
Encapsulate the data cleaning logic in a class, such as DataCleaner. Make it so that instances read synonym data once when instantiated and then retain that information as part of their state. Have the class expose a clean method that operates on the data:
class FileIO(object):
def __init__(self, file_path, some_num, header):
pass
def openSynonyms(self):
return []
class DataCleaner(object):
def __init__(self, synonym_file):
self.synonyms = FileIO(synonym_file, 3, header=False).openSynonyms()
def clean(self, data):
for g in data:
if g in self.synonyms:
# ...
pass
if __name__ == '__main__':
dataCleaner = DataCleaner('raw_data/input_file')
dataCleaner.clean('some data here')
dataCleaner.clean('some more data here')
As a possible future optimisation, you can expand on this approach to use a factory method to create instances of DataCleaner which can cache instances based on the synonym file provided (so you don't need to do expensive recomputation every time for the same file).
I think the cleanest way to do this would be to decorate your "clean" (pun intended) function with another function that provides the synonyms local for the function. this is iamo cleaner and more concise than creating another custom class, yet still allows you to easily change the "input_data" file if you need to (factory function):
def defineSynonyms(datafile):
def wrap(func):
def wrapped(*args, **kwargs):
kwargs['synonyms'] = FileIO(datafile, 3, header=False).openSynonyms()
return func(*args, **kwargs)
return wrapped
return wrap
#defineSynonyms("raw_data/input_data")
def clean(my_data, synonyms={}):
# do stuff with synonyms and my_data...
pass

How to overwrite an existing dictionary Python 3

Sorry if this is worded badly, I hope you can understand/edit my question to make it easier to understand.
Ive been using python pickle to pickle/unpickle the state of the objects in a game (i do understand this is probably very storage/just generally inefficient and lazy but its only whilst im learning more python). However I encounter errors when doing this with the classes for presenting information.
The issue at root I believe is that when I unpickle the save data to load, it overwrites the existing dictionaries but the object storage points change, so the information class is trying to detect a room that the player can no longer enter since the data was overwritten.
I've made a snippet to reproduce the issue I have:
import pickle
class A(object):
def __init__(self):
pass
obj_dict = {
'a' : A(),
'b' : A()
## etc.
}
d = obj_dict['a']
f = open('save', 'wb')
pickle.Pickler(f,2).dump(obj_dict)
f.close()
f = open('save', 'rb')
obj_dict = pickle.load(f)
f.close()
if d == obj_dict['a']:
print('success')
else:
print(str(d) + '\n' + str(obj_dict['a']))
I understand this is probably to be expected when rewriting variables like this, but is there a way around it? Many thanks
Is your issue that you want d == obj_dict['a'] to evaluate to true?
By default, the above == equality check will compare the references of the two objects. I.e. does d and obj_dict['a'] point to the same chunk of memory?
When you un-pickle your object, it will be created as a new object, in a new chunk of memory and thus your equality check will fail.
You need to override how your equality check behaves to get the behavior you want. The methods you need to override are: __eq__ and __hash__.
In order to track your object through repeated pickling and un-pickling, you'll need to assign a unique id to the object on creation:
class A:
def __init__(self):
self.id = uuid.uuid4() # assign a unique, random id
Now you must override the methods mentioned above:
def __eq__( self, other ):
# is the other object also a class A and does it have the same id
return isinstance( other, A ) and self.id == other.id
def __hash__( self ):
return hash(self.id)

Reading binary file to a list of structs, but deepcopy overwrites first structs

I am reading a binary file into a list of class instances. I have a loop that reads data from the file into an instance. When the instance is filled, I append the instance to a list and start reading again.
This works fine except that one of the elements of the instance is a Rect (i.e. rectangle), which is a user-defined type. Even with deepcopy, the attributes are overwritten.
There are work-arounds, like not having Rect be a user-defined type. However, I can see that this is a situation that I will encounter a lot and was hoping there was a straightforward solution that allows me to read nested types in a loop.
Here is some code:
class Rect:
def __init__(self):
self.L = 0
class groundtruthfile:
def __init__(self):
self.rect = Rect
self.ht = int
self.wt = int
self.text = ''
...
data = []
g = groundtruthfile()
f = open("datafile.dtf", "rb")
length = unpack('i', f.read(4))
for i in range(1,length[0]+1): #length is a tuple
g.rect.L = unpack('i',f.read(4))[0]
...
data.append(copy.deepcopy(g))
The results of this are exactly what I want, except that all of the data(i).rect.L are the value of the last data read.
You have two problems here:
The rect attribute of a groundtruthfile instance (I'll just put this here...) is the Rect class itself, not an instance of that class - you should be doing:
self.rect = Rect() # note parentheses
to create an instance, instead (similarly e.g. self.ht = int sets that attribute to the integer class, not an instance); and
The line:
g.rect.L = unpack('i',f.read(4))[0]
explicitly modifies the attribute of the same groundtruthfile instance you've been using all along. You should move the line:
g = groundtruthfile()
inside the loop, so that you create a separate instance each time, rather than trying to create copies.
This is just a minimal fix - it would make sense to actually provide arguments to the various __init__ methods, for example, such that you can create instances in a more intuitive way.
Also, if you're not actually using i in the loop:
for _ in range(length[0]):
is neater than:
for i in range(1,length[0]+1):

In Python, is the method to load objects part of the class?

I want to read a text file, manipulate the fields a bit, and load them into instance variables for an object. Each row of the text would be stored in one object, so reading the whole file should return a list of objects.
Here's an example of the file:
L26 [coords]704:271[/coords] (1500)
L23 [coords]681:241[/coords] (400)
L20 [coords]709:229[/coords] (100)
And here's part of the current class definition:
class Poi(object):
'''Points of Interest have a location, level and points'''
def __init__(self, level, coords, points):
self.level = level
self.coordinates = coords
self.points = points
I'm new to this, and probably overthinking it by a lot, but it seems like the method to read and write the list of Pois should be part of the Poi class. Is there a correct way to do that, or is the right answer to have a separate function like this one?
def load_poi_txt(source_file, source_dir):
poi_list = []
pass
return poi_list
Both are correct, depending on what you want. Here's the method skeleton:
class Poi(object):
...
#classmethod
def load_from_txt(cls, source_file, source_dir):
res = []
while (still more to find):
# find level, coords, and points
res.append(cls(level, coords, points))
return res
Note how it uses cls, which is the class the method is defined on. In this case it is Poi, but it could just as easily be a subclass of Poi defined later without needing to change the method itself.

Categories