Loading pickled python object use enormous amount of memory - python

I am having a python's pickled object which generates a 180 Mb file. When I unpickle it, the memory usage explode to 2 or 3Gb. Do you have similar experience? Is it normal?
The object is a tree containing a dictionary : each edge is a letter, and each node is a potential word. So to store a word you need as much edges as the length of this word. So, the first level is 26 node maximum, the second one is 26^2, the third 26^3, etc... For each node being a word I have an attribute pointing toward the informations about the word (verb, noun, definition, etc...).
I have words of about 40 characters maximum. I have around half a million entry. Everything goes fine till I pickle (using a simple cpickle dump) : it gives a 180 Mb file.
I am on Mac OS, and when I unpickle these 180 Mb, the OS give 2 or 3 Gb of "memory / virtual memory" to the python process :(
I don't see any recursion on this tree : the edges have nodes having themselves an array of array. No recursion involved.
I am a bit stuck : the loading of these 180 Mb is around 20 sec (not speaking about the memory issue). I have to say my CPU is not that fast : core i5, 1.3Ghz. But my hard drive is an ssd. I only have 4Gb of memory.
To add these 500 000 word in my tree, I read about 7 000 files containing each one about 100 words. Making this reading make the memory allocated by mac os going up to 15 Gb, mainly on virtual memory :( I have been using the "with" statement ensuring the closing of each file, but doesn't really help. Reading a file take around 0.2 sec for 40 Ko. Seems quite long to me. Adding it to the tree is much faster (0.002 sec).
Finally I wanted to make an object database, but I guess python is not suitable to that. Maybe I will go for a MongoDB :(
class Trie():
"""
Class to store known entities / word / verbs...
"""
longest_word = -1
nb_entree = 0
def __init__(self):
self.children = {}
self.isWord = False
self.infos =[]
def add(self, orthographe, entree):
"""
Store a string with the given type and definition in the Trie structure.
"""
if len(orthographe) >Trie.longest_word:
Trie.longest_word = len(orthographe)
if len(orthographe)==0:
self.isWord = True
self.infos.append(entree)
Trie.nb_entree += 1
return True
car = orthographe[0]
if car not in self.children.keys():
self.children[car] = Trie()
self.children[car].add(orthographe[1:], entree)

Python objects, especially on a 64-bit machine, are very big. When pickled, an object gets a very compact representation that is suitable for a disk file. Here's an example of a disassembled pickle:
>>> pickle.dumps({'x':'y','z':{'x':'y'}},-1)
'\x80\x02}q\x00(U\x01xq\x01U\x01yq\x02U\x01zq\x03}q\x04h\x01h\x02su.'
>>> pickletools.dis(_)
0: \x80 PROTO 2
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: U SHORT_BINSTRING 'x'
9: q BINPUT 1
11: U SHORT_BINSTRING 'y'
14: q BINPUT 2
16: U SHORT_BINSTRING 'z'
19: q BINPUT 3
21: } EMPTY_DICT
22: q BINPUT 4
24: h BINGET 1
26: h BINGET 2
28: s SETITEM
29: u SETITEMS (MARK at 5)
30: . STOP
As you can see, it is very compact. Nothing is repeated if it is possible.
When in memory, however, an object consists of a fairly sizable number of pointers. Let's ask Python how big an empty dictionary is (64-bit machine):
>>> {}.__sizeof__()
248
Wow! 248 bytes for an empty dictionary! Note that the dictionary comes pre-allocated with room for up to eight elements. However, you pay the same memory cost even if you have one element in the dictionary.
A class instance contains one dictionary to hold the instance variables. Your tries have an additional dictionary for the children. So, each instance costs you nearly 500 bytes. With an estimated 2-4 million Trie objects, you can easily see where your memory usage comes from.
You can mitigate this a bit by adding a __slots__ to your Trie to eliminate the instance dictionary. You'll probably save about 750MB by doing this (my guess). It will prevent you from being able to add more variables to the Trie, but this is probably not a huge problem.

Do you really need it to load or dump all of it in memory all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.
klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.
>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True)
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem
>>> demo.dump()
>>> del demo
>>>
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>
klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc).
It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.
klepto also provides a lot of caching algorithms (like mru, lru, lfu, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.
You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.
>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]
However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.
Get klepto here: https://github.com/uqfoundation

Related

Python Record Linkage Multiple Cores

Does anyone have experience using the Record Linkage Toolkit with extremely large datasets? I have a few questions. Utlimately, I need to deploy it to an EC2 instance, but for now, I'm trying to figure out how to take advantage of parallel processing - I'll want to do the same on EC2.
When specifying the number of cores (njobs), the code actually runs significantly SLOWER than if I don't specify multiple cores.
compare_dupes = rl.Compare(n_jobs=12)
Related to this - I am working on a record set with 12 million customer records that need to be deduped. Currently I'm blocking on first name, last name, and zip code.
However, the number of potential record pairs index is still so large that it causes memory failures. I have tried Dask - no luck. I'm not sure what else to try. Anyone have suggestions? My code looks like this:
# this section creates a huge multi-index, which causes memory failures
dupe_indexer = rl.Index()
dupe_indexer.block(['first_name_clean','last_name_clean','zip_clean'])
dupe_candidate_links = dupe_indexer.index(df_c)
# I can put n_jobs=12 (the number of cores) in the Compare function below,
# but for some reason it actually performs worse
compare_dupes = rl.Compare()
compare_dupes.string('first_name_clean','first_name_clean', method='jarowinkler', threshold=0.85, label='first_name_cl')
compare_dupes.string('last_name_clean','last_name_clean', method='jarowinkler', threshold=0.85, label='last_name_cl')
compare_dupes.string('email', 'email', method='jarowinkler', threshold=0.90, label='email_cl')
compare_dupes.string('address_clean','address_clean', method='damerau_levenshtein', threshold=0.6, label='address_cl')
compare_dupes.string('zip_clean','zip_clean', method='jarowinkler',threshold=0.90, label='zip_cl'
dupe_features = compare_dupes.compute(dupe_candidate_links, df_c).reset_index()
I have also tried the "index_split" method:
s = rl.index_split(dupe_candidate_links, 100)
for chunk in s:
which works for reasonably size data sets < 2 million, but when the size of the dataset gets beyond that 5, 8, 10, 15, 20 million - even the index can't fit into memory.
Thanks for any support!
indexer = recordlinkage.Index()
#Create indexing object
indexer = rl.SortedNeighbourhoodIndex(on='X')
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(A, B)
%%time
comp = recordlinkage.Compare()
comp.string('X', 'X', method='jarowinkler', threshold=0.60)
mymatchestwonew = comp.compute(candidate_links, A, B)

What is the fastest way to sort and unpack a large bytearray?

I have a large binary file that needs to be converted into hdf5 file format.
I am using Python3.6. My idea is to read in the file, sort the relevant information, unpack it and store it away. My information is stored in a way that the 8 byte time is followed by 2 bytes of energy and then 2 bytes of extra information, then again time, ... My current way of doing it, is the following (my information is read as an bytearray, with the name byte_array):
for i in range(0, len(byte_array)+1, 12):
if i == 0:
timestamp_bytes = byte_array[i:i+8]
energy_bytes = byte_array[i+8:i+10]
extras_bytes = byte_array[i+10:i+12]
else:
timestamp_bytes += byte_array[i:i+8]
energy_bytes += byte_array[i+8:i+10]
extras_bytes += byte_array[i+10:i+12]
timestamp_array = np.ndarray((len(timestamp_bytes)//8,), '<Q',timestamp_bytes)
energy_array = np.ndarray((len(energy_bytes) // 2,), '<h', energy_bytes)
extras_array = np.ndarray((len(timestamp_bytes) // 8,), '<H', extras_bytes)
I assume there is a much faster way of doing this, maybe by avoiding to loop over the the whole thing. My files are up to 15GB in size so every bit of improvement would help a lot.
You should be able to just tell NumPy to interpret the data as a structured array and extract fields:
as_structured = numpy.ndarray(shape=(len(byte_array)//12,),
dtype='<Q, <h, <H',
buffer=byte_array)
timestamps = as_structured['f0']
energies = as_structured['f1']
extras = as_structured['f2']
This will produce three arrays backed by the input bytearray. Creating these arrays should be effectively instant, but I can't guarantee that working with them will be fast - I think NumPy may need to do some implicit copying to handle alignment issues with these arrays. It's possible (I don't know) that explicitly copying them yourself with .copy() first might speed things up.
You can use numpy.frombuffer with a custom datatype:
import struct
import random
import numpy as np
data = [
(random.randint(0, 255**8), random.randint(0, 255*255), random.randint(0, 255*255))
for _ in range(20)
]
Bytes = b''.join(struct.pack('<Q2H', *row) for row in data)
dtype = np.dtype([('time', np.uint64),
('energy', np.uint16), # you may need to change that to `np.int16`, if energy can be negative
('extras', np.uint16)])
original = np.array(data, dtype=np.uint64)
result = np.frombuffer(Bytes, dtype)
print((result['time'] == original[:, 0]).all())
print((result['energy'] == original[:, 1]).all())
print((result['extras'] == original[:, 2]).all())
print(result)
Example output:
True
True
True
[(6048800706604665320, 52635, 291) (8427097887613035313, 15520, 4976)
(3250665110135380002, 44078, 63748) (17867295175506485743, 53323, 293)
(7840430102298790024, 38161, 27601) (15927595121394361471, 47152, 40296)
(8882783920163363834, 3480, 46666) (15102082728995819558, 25348, 3492)
(14964201209703818097, 60557, 4445) (11285466269736808083, 64496, 52086)
(6776526382025956941, 63096, 57267) (5265981349217761773, 19503, 32500)
(16839331389597634577, 49067, 46000) (16893396755393998689, 31922, 14228)
(15428810261434211689, 32003, 61458) (5502680334984414629, 59013, 42330)
(6325789410021178213, 25515, 49850) (6328332306678721373, 59019, 64106)
(3222979511295721944, 26445, 37703) (4490370317582410310, 52413, 25364)]
I'm not an expert on numpy, but here's my 5 cents:
You have lots of data, and probably it's more than your RAM.
This points to the simplest solution - don't try to fit all data in your program.
When you read a file into a variable - then the X GB is being read into RAM. If it's more than available RAM, then swapping is done by your OS. Swapping slows you down, since not only do you have disk operations for reading from source file, now you also have writing to disk to dump RAM contents into swap file.
Instead of that write the script so that it uses parts of the input file as necessary (in your case you read the file along anyways and don't go back or jump far ahead).
Try opening the input file as memory mapped data structure (please note differences in usage between Unix and windows environments)
Then you can do simple read([n]) bytes at a time and append that to your arrays.
behind the scenes data is read into RAM page by page as needed and will not exceed the available memory, also, leaving more space for your arrays to grow.
Also consider the fact that your resultant arrays can also outgrow RAM, which will cause similar slowdown as reading of a big file.

Fill up a dictionary in parallel with multiprocessing

Yesterday i asked a question: Reading data in parallel with multiprocess
I got very good answers, and i implemented the solution mentioned in the answer i marked as correct.
def read_energies(motif):
os.chdir("blabla/working_directory")
complx_ener = pd.DataFrame()
# complex function to fill that dataframe
lig_ener = pd.DataFrame()
# complex function to fill that dataframe
return motif, complx_ener, lig_ener
COMPLEX_ENERGIS = {}
LIGAND_ENERGIES = {}
p = multiprocessing.Pool(processes=CPU)
for x in p.imap_unordered(read_energies, peptide_kd.keys()):
COMPLEX_ENERGIS[x[0]] = x[1]
LIGAND_ENERGIES[x[0]] = x[2]
However, this solution takes the same amount of time as if i would just iterate over peptide_kd.keys() and fill up the DataFrames one by one. Why is that so? Is there a way to fill up the desired dicts in parallel and actually get a speed increase? i am running it on a 48 core HPC.
You are incurring a good amount of overhead in (1) starting up each process, and (2) having to copy the pandas.DataFrame (and etc) across several processes. If you just need to have a dict filled in parallel, I'd suggest using a shared memory dict. If no key will be overwritten, then it's easy and you don't have to worry about locks.
(Note I'm using multiprocess below, which is a fork of multiprocessing -- but only so I can demonstrate from the interpreter, otherwise, you'd have to do the below from __main__).
>>> from multiprocess import Process, Manager
>>>
>>> def f(d, x):
... d[x] = x**2
...
>>> manager = Manager()
>>> d = manager.dict()
>>> job = [Process(target=f, args=(d, i)) for i in range(5)]
>>> _ = [p.start() for p in job]
>>> _ = [p.join() for p in job]
>>> print d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
This solution doesn't make copies of the dict to share across processes, so that part of the overhead is reduced. For large objects like a pandas.DataFrame, it can be significant compared to the cost of a simple operation like x**2. Similarly, spawning a Process can take time, and you maybe be able to do the above even faster (for lightweight objects) by using threads (i.e. from multiprocess.dummy instead of multiprocess for either your originally posted solution or mine above).
If you do need to share DataFrames (as your code suggests instead of as the question asks), you might be able to do it by creating a shared memory numpy.ndarray.

Python memoryerror creating large dictionary

I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.
class Node(object):
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "node":
lat = float(elem.get('lat'))
lon = float(elem.get('lon'))
osm_id = int(elem.get('id'))
nodes[osm_id] = Node(osm_id, lat, lon)
root.clear()
I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.
How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.
Assuming you have tons of Nodes being created, you might consider using __slots__ to predefine a fixed set of attributes for each Node. This removes the overhead of storing a per-instance __dict__ (in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node by a factor of ~5x (less on Python 3.3+ where shared key __dict__ reduces the per-instance memory cost for free).
It's easy to do, just change the declaration of Node to:
class Node(object):
__slots__ = 'osmid', 'latitude', 'longitude', 'count'
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:
>>> import sys
>>> ... define Node without __slots___
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
248
>>> ... define Node with __slots__
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) # It has no __dict__ now
72
And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__ would be similar (one pointer sized variable larger IIRC), while the cost without __slots__ would go up by a few hundred bytes.
Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.

Flyweight pattern - Memory footprint

I'm learning Python and I've thought it would be a nice excuse to refresh my pattern knowledge and in that case, the Flyweight pattern.
I created two small programs, one that is not optimized and one is implementing the Flyweight pattern. For my tests purposes, I'm creating an army of 1'000'000 Enemy objects. Each enemy can be of three types (Soldier, Ninja or Chief) and I assign a motto to each type.
What I would like to check is that, with my un-optimized program, I get 1'000'000 enemies with, for each and everyone of them, a type and a "long" string containing the motto.
With the optimized code, I'd like to create only three objects (EnemyType) matching each type and containing only 3 times the motto's strings. Then, I add a member to each Enemy, pointing to the desired EnemyType.
Now the code (excerpts only) :
Un-optimized program
In this version, each enemy stores its type and motto.
enemyList = []
enemyTypes = {'Soldier' : 'Sir, yes sir!', 'Ninja' : 'Always behind you !', 'Chief' : 'Always behind ... lot of lines '}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 # random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.keys()[randomTypeIndex]
# Here, the type and motto are parameters of EACH enemy object.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Optimized program
In this version, each enemy has a member of an EnemyType object that stores its type and motto. Only three instances of EnemyType are created and I should see the impact in my memory footprint.
enemyList = []
soldierEnemy = EnemyType('Soldier', 'Sir, yes sir!')
ninjaEnemy = EnemyType('Ninja', 'Always behind you !')
chiefEnemy = EnemyType('Chief', 'Always behind ... lot of lines.')
enemyTypes = {'Soldier' : soldierEnemy, 'Ninja' : ninjaEnemy, 'Chief' : chiefEnemy}
enemyCount = {}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 #random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.values()[randomTypeIndex]
# Here, each enemy only has a reference on its type.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType))
Now I'm using this to get my memory footprint (at the very last lines before my application closes itself) :
import os
import psutil
...
# return the memory usage in MB
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20)
My problem is that, I don't see any difference between the output of my two programs :
Optimized = 384.0859375 Mb
Un-optimized = 383.40234375 Mb
Is it the proper tool to get the memory footprint ? I'm new to Python so it could be a problem with my code but I checked my EnemyType objects in the second solution and I indeed have only three occurences. I therefore should have 3 motto strings instead of 1'000'000.
I've read about a tool called Heapy for Python, would it be more accurate here ?
As far as I could tell from the code in the question, in both cases you're just using references for the same small number of instances anyway.
Take the "unoptimized" version's:
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Indeed enemyTypes[enemyType] is a string, which might have made you think you have many instances of strings. But in reality, each of your objects has one of the three same string objects.
You can check this by comparing the ids of the members. Make a set of the ids, and see if it is larger than 3.

Categories