I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.
class Node(object):
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "node":
lat = float(elem.get('lat'))
lon = float(elem.get('lon'))
osm_id = int(elem.get('id'))
nodes[osm_id] = Node(osm_id, lat, lon)
root.clear()
I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.
How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.
Assuming you have tons of Nodes being created, you might consider using __slots__ to predefine a fixed set of attributes for each Node. This removes the overhead of storing a per-instance __dict__ (in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node by a factor of ~5x (less on Python 3.3+ where shared key __dict__ reduces the per-instance memory cost for free).
It's easy to do, just change the declaration of Node to:
class Node(object):
__slots__ = 'osmid', 'latitude', 'longitude', 'count'
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:
>>> import sys
>>> ... define Node without __slots___
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
248
>>> ... define Node with __slots__
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) # It has no __dict__ now
72
And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__ would be similar (one pointer sized variable larger IIRC), while the cost without __slots__ would go up by a few hundred bytes.
Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.
Related
Does anyone have experience using the Record Linkage Toolkit with extremely large datasets? I have a few questions. Utlimately, I need to deploy it to an EC2 instance, but for now, I'm trying to figure out how to take advantage of parallel processing - I'll want to do the same on EC2.
When specifying the number of cores (njobs), the code actually runs significantly SLOWER than if I don't specify multiple cores.
compare_dupes = rl.Compare(n_jobs=12)
Related to this - I am working on a record set with 12 million customer records that need to be deduped. Currently I'm blocking on first name, last name, and zip code.
However, the number of potential record pairs index is still so large that it causes memory failures. I have tried Dask - no luck. I'm not sure what else to try. Anyone have suggestions? My code looks like this:
# this section creates a huge multi-index, which causes memory failures
dupe_indexer = rl.Index()
dupe_indexer.block(['first_name_clean','last_name_clean','zip_clean'])
dupe_candidate_links = dupe_indexer.index(df_c)
# I can put n_jobs=12 (the number of cores) in the Compare function below,
# but for some reason it actually performs worse
compare_dupes = rl.Compare()
compare_dupes.string('first_name_clean','first_name_clean', method='jarowinkler', threshold=0.85, label='first_name_cl')
compare_dupes.string('last_name_clean','last_name_clean', method='jarowinkler', threshold=0.85, label='last_name_cl')
compare_dupes.string('email', 'email', method='jarowinkler', threshold=0.90, label='email_cl')
compare_dupes.string('address_clean','address_clean', method='damerau_levenshtein', threshold=0.6, label='address_cl')
compare_dupes.string('zip_clean','zip_clean', method='jarowinkler',threshold=0.90, label='zip_cl'
dupe_features = compare_dupes.compute(dupe_candidate_links, df_c).reset_index()
I have also tried the "index_split" method:
s = rl.index_split(dupe_candidate_links, 100)
for chunk in s:
which works for reasonably size data sets < 2 million, but when the size of the dataset gets beyond that 5, 8, 10, 15, 20 million - even the index can't fit into memory.
Thanks for any support!
indexer = recordlinkage.Index()
#Create indexing object
indexer = rl.SortedNeighbourhoodIndex(on='X')
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(A, B)
%%time
comp = recordlinkage.Compare()
comp.string('X', 'X', method='jarowinkler', threshold=0.60)
mymatchestwonew = comp.compute(candidate_links, A, B)
I am working with a lot of objects that have some attributes as well as numpy arrays (images, masks, etc.). I want to dump them onto disk during program execution to save memory and want to append more data when it is available (during same program execution) without loading the dumped object into memory.
The problem is that appending data to serialized/pickled file cannot be done without first loading it into memory. How can I save/update (without loading the whole object) these objects during program execution? Any idea is welcome.
The below is pseudocode.
class StoredObject():
def __init__(self, centroid, _image, _color, _bbox, _type, _mask):
self.centroids = [centroid]
self.bboxes = [_bbox]
self.track_color = random_color()
self.color = _color
self.images = [_image]
self.type = _type
self.last_appear = time.time()
self.masks = [_mask]
store = []
track_objects(obj, obj_image, obj_mask):
if obj already belongs to store:
find where it is stored earlier
and add obj_image and obj_mask to its obj_image list
and obj_mask list respectively
else
add obj(obj_image, obj_mask) in store
My object looks like this:
class Note(object):
def __init__(self, note, vel, t, tOff=0):
self.note = note # ubyte
self.vel = vel # ubyte
self.t = t # float
self.tOff = tOff # float
(The type indications show the precision required for each field, rather than how Python is actually storing these!)
My program constructs an array of possibly several thousand Note-s.
I need to convert this array into a string, so that I can AJAX it to the server for storage (and subsequently retrieve and convert back to the original data structure).
I'm using Brython which implements Python's JSON library (I've tested: import json works. So I suspect JSON is my best bet.
But Brython is not a full CPython implementation, so I probably can't import non-core libraries. And it looks as though I can't do anything fancy like use slots to make for a storage-efficient class. (Brython maps Python constructs onto appropriate JavaScript constructs).
In theory I should be able to get each note down to 10 bytes, but I am aiming for lean code offering reasonably compact storage rather than ultimate compactness. I would however like to avoid massive inefficiency such as storing each note as a keyvalue pair -- i.e. the keys would be getting duplicated.
If I could see the range of solutions available to me, I could choose an appropriate complexity vs compactness trade-off. That is to say, I would be grateful for an answer anywhere on the continuum.
A quick test using struct seems to give you a possible length of 12 bytes as follows:
import struct
class Note(object):
def __init__(self, note, vel, t, tOff=0):
self.note = note # ubyte
self.vel = vel # ubyte
self.t = t # float
self.tOff = tOff # float
def pack(self):
return struct.pack("BBff", self.note, self.vel, self.t, self.tOff)
def unpack(self, packed):
self.note, self.vel, self.t, self.tOff = struct.unpack("BBff", packed)
note = Note(10, 250, 2.9394286605624826e+32, 1.46971433028e+32)
packed = note.pack()
print "Packed length:", len(packed)
note2 = Note(0,0,0)
note2.unpack(packed)
print note2.note, note2.vel, note2.t, note2.tOff
This displays:
Packed length: 12
10 250 2.93942866056e+32 1.46971433028e+32
You might be able to further compact it depending on the type of floats you need, i.e. is some kind of fixed point possible?
To pack a list of notes, something like the following could be used:
notes = [1,2,3,4,5,6,7,8,9,10]
print struct.pack("{}B".format(len(notes)), *notes)
But the unpack needs to then be the same length. Or you could add a length byte to aid with unpacking:
notes = [1,2,3,4,5,6,7,8,9,10]
packed = struct.pack("B{}B".format(len(notes)), len(notes), *notes)
length = struct.unpack("B", packed[0])[0]
print struct.unpack("{}B".format(length), packed[1:])
This would display the correctly unpacked data:
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
I'm learning Python and I've thought it would be a nice excuse to refresh my pattern knowledge and in that case, the Flyweight pattern.
I created two small programs, one that is not optimized and one is implementing the Flyweight pattern. For my tests purposes, I'm creating an army of 1'000'000 Enemy objects. Each enemy can be of three types (Soldier, Ninja or Chief) and I assign a motto to each type.
What I would like to check is that, with my un-optimized program, I get 1'000'000 enemies with, for each and everyone of them, a type and a "long" string containing the motto.
With the optimized code, I'd like to create only three objects (EnemyType) matching each type and containing only 3 times the motto's strings. Then, I add a member to each Enemy, pointing to the desired EnemyType.
Now the code (excerpts only) :
Un-optimized program
In this version, each enemy stores its type and motto.
enemyList = []
enemyTypes = {'Soldier' : 'Sir, yes sir!', 'Ninja' : 'Always behind you !', 'Chief' : 'Always behind ... lot of lines '}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 # random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.keys()[randomTypeIndex]
# Here, the type and motto are parameters of EACH enemy object.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Optimized program
In this version, each enemy has a member of an EnemyType object that stores its type and motto. Only three instances of EnemyType are created and I should see the impact in my memory footprint.
enemyList = []
soldierEnemy = EnemyType('Soldier', 'Sir, yes sir!')
ninjaEnemy = EnemyType('Ninja', 'Always behind you !')
chiefEnemy = EnemyType('Chief', 'Always behind ... lot of lines.')
enemyTypes = {'Soldier' : soldierEnemy, 'Ninja' : ninjaEnemy, 'Chief' : chiefEnemy}
enemyCount = {}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 #random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.values()[randomTypeIndex]
# Here, each enemy only has a reference on its type.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType))
Now I'm using this to get my memory footprint (at the very last lines before my application closes itself) :
import os
import psutil
...
# return the memory usage in MB
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20)
My problem is that, I don't see any difference between the output of my two programs :
Optimized = 384.0859375 Mb
Un-optimized = 383.40234375 Mb
Is it the proper tool to get the memory footprint ? I'm new to Python so it could be a problem with my code but I checked my EnemyType objects in the second solution and I indeed have only three occurences. I therefore should have 3 motto strings instead of 1'000'000.
I've read about a tool called Heapy for Python, would it be more accurate here ?
As far as I could tell from the code in the question, in both cases you're just using references for the same small number of instances anyway.
Take the "unoptimized" version's:
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Indeed enemyTypes[enemyType] is a string, which might have made you think you have many instances of strings. But in reality, each of your objects has one of the three same string objects.
You can check this by comparing the ids of the members. Make a set of the ids, and see if it is larger than 3.
I am having a python's pickled object which generates a 180 Mb file. When I unpickle it, the memory usage explode to 2 or 3Gb. Do you have similar experience? Is it normal?
The object is a tree containing a dictionary : each edge is a letter, and each node is a potential word. So to store a word you need as much edges as the length of this word. So, the first level is 26 node maximum, the second one is 26^2, the third 26^3, etc... For each node being a word I have an attribute pointing toward the informations about the word (verb, noun, definition, etc...).
I have words of about 40 characters maximum. I have around half a million entry. Everything goes fine till I pickle (using a simple cpickle dump) : it gives a 180 Mb file.
I am on Mac OS, and when I unpickle these 180 Mb, the OS give 2 or 3 Gb of "memory / virtual memory" to the python process :(
I don't see any recursion on this tree : the edges have nodes having themselves an array of array. No recursion involved.
I am a bit stuck : the loading of these 180 Mb is around 20 sec (not speaking about the memory issue). I have to say my CPU is not that fast : core i5, 1.3Ghz. But my hard drive is an ssd. I only have 4Gb of memory.
To add these 500 000 word in my tree, I read about 7 000 files containing each one about 100 words. Making this reading make the memory allocated by mac os going up to 15 Gb, mainly on virtual memory :( I have been using the "with" statement ensuring the closing of each file, but doesn't really help. Reading a file take around 0.2 sec for 40 Ko. Seems quite long to me. Adding it to the tree is much faster (0.002 sec).
Finally I wanted to make an object database, but I guess python is not suitable to that. Maybe I will go for a MongoDB :(
class Trie():
"""
Class to store known entities / word / verbs...
"""
longest_word = -1
nb_entree = 0
def __init__(self):
self.children = {}
self.isWord = False
self.infos =[]
def add(self, orthographe, entree):
"""
Store a string with the given type and definition in the Trie structure.
"""
if len(orthographe) >Trie.longest_word:
Trie.longest_word = len(orthographe)
if len(orthographe)==0:
self.isWord = True
self.infos.append(entree)
Trie.nb_entree += 1
return True
car = orthographe[0]
if car not in self.children.keys():
self.children[car] = Trie()
self.children[car].add(orthographe[1:], entree)
Python objects, especially on a 64-bit machine, are very big. When pickled, an object gets a very compact representation that is suitable for a disk file. Here's an example of a disassembled pickle:
>>> pickle.dumps({'x':'y','z':{'x':'y'}},-1)
'\x80\x02}q\x00(U\x01xq\x01U\x01yq\x02U\x01zq\x03}q\x04h\x01h\x02su.'
>>> pickletools.dis(_)
0: \x80 PROTO 2
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: U SHORT_BINSTRING 'x'
9: q BINPUT 1
11: U SHORT_BINSTRING 'y'
14: q BINPUT 2
16: U SHORT_BINSTRING 'z'
19: q BINPUT 3
21: } EMPTY_DICT
22: q BINPUT 4
24: h BINGET 1
26: h BINGET 2
28: s SETITEM
29: u SETITEMS (MARK at 5)
30: . STOP
As you can see, it is very compact. Nothing is repeated if it is possible.
When in memory, however, an object consists of a fairly sizable number of pointers. Let's ask Python how big an empty dictionary is (64-bit machine):
>>> {}.__sizeof__()
248
Wow! 248 bytes for an empty dictionary! Note that the dictionary comes pre-allocated with room for up to eight elements. However, you pay the same memory cost even if you have one element in the dictionary.
A class instance contains one dictionary to hold the instance variables. Your tries have an additional dictionary for the children. So, each instance costs you nearly 500 bytes. With an estimated 2-4 million Trie objects, you can easily see where your memory usage comes from.
You can mitigate this a bit by adding a __slots__ to your Trie to eliminate the instance dictionary. You'll probably save about 750MB by doing this (my guess). It will prevent you from being able to add more variables to the Trie, but this is probably not a huge problem.
Do you really need it to load or dump all of it in memory all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.
klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.
>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True)
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem
>>> demo.dump()
>>> del demo
>>>
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>
klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc).
It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.
klepto also provides a lot of caching algorithms (like mru, lru, lfu, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.
You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.
>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]
However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.
Get klepto here: https://github.com/uqfoundation