Space-efficient (de)serialise array of MyClass instances - python

My object looks like this:
class Note(object):
def __init__(self, note, vel, t, tOff=0):
self.note = note # ubyte
self.vel = vel # ubyte
self.t = t # float
self.tOff = tOff # float
(The type indications show the precision required for each field, rather than how Python is actually storing these!)
My program constructs an array of possibly several thousand Note-s.
I need to convert this array into a string, so that I can AJAX it to the server for storage (and subsequently retrieve and convert back to the original data structure).
I'm using Brython which implements Python's JSON library (I've tested: import json works. So I suspect JSON is my best bet.
But Brython is not a full CPython implementation, so I probably can't import non-core libraries. And it looks as though I can't do anything fancy like use slots to make for a storage-efficient class. (Brython maps Python constructs onto appropriate JavaScript constructs).
In theory I should be able to get each note down to 10 bytes, but I am aiming for lean code offering reasonably compact storage rather than ultimate compactness. I would however like to avoid massive inefficiency such as storing each note as a keyvalue pair -- i.e. the keys would be getting duplicated.
If I could see the range of solutions available to me, I could choose an appropriate complexity vs compactness trade-off. That is to say, I would be grateful for an answer anywhere on the continuum.

A quick test using struct seems to give you a possible length of 12 bytes as follows:
import struct
class Note(object):
def __init__(self, note, vel, t, tOff=0):
self.note = note # ubyte
self.vel = vel # ubyte
self.t = t # float
self.tOff = tOff # float
def pack(self):
return struct.pack("BBff", self.note, self.vel, self.t, self.tOff)
def unpack(self, packed):
self.note, self.vel, self.t, self.tOff = struct.unpack("BBff", packed)
note = Note(10, 250, 2.9394286605624826e+32, 1.46971433028e+32)
packed = note.pack()
print "Packed length:", len(packed)
note2 = Note(0,0,0)
note2.unpack(packed)
print note2.note, note2.vel, note2.t, note2.tOff
This displays:
Packed length: 12
10 250 2.93942866056e+32 1.46971433028e+32
You might be able to further compact it depending on the type of floats you need, i.e. is some kind of fixed point possible?
To pack a list of notes, something like the following could be used:
notes = [1,2,3,4,5,6,7,8,9,10]
print struct.pack("{}B".format(len(notes)), *notes)
But the unpack needs to then be the same length. Or you could add a length byte to aid with unpacking:
notes = [1,2,3,4,5,6,7,8,9,10]
packed = struct.pack("B{}B".format(len(notes)), len(notes), *notes)
length = struct.unpack("B", packed[0])[0]
print struct.unpack("{}B".format(length), packed[1:])
This would display the correctly unpacked data:
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Related

How to store 25M 3-D int tuple with python?

As a hobby, I try to code a basic game in python, and I need to store a map of the game world. It can be viewed as a 2-D array to store height. The point is, for the moment, my map dimensions are 5000x5000.
I store that in a sqlite db (schema : CREATE TABLE map (x SMALLINT, y SMALLINT, h SMALLINT); + VACCUM at the end of the creation), but it take up to 500MB on the disk.
I can compress (lzma, for example) the sqlite file, and it only takes ~35-40MB, but in order to use it in python, I need to unzip it first, so it always ends up taking so much place.
How would you store that kind of data in python ?
A 2-D array of int, or a list 3-int tuple of that dimensions (or bigger) and it could still run on a Raspberry Pi ? Speed is not important, but RAM and file size are.
You need 10 bits to store each height, so 10 bytes can store 8 heights, and thus 31.25Mo can store all 25,000,000 of them. You can figure out which block of 10 bytes stores a desired height (how depends on how you arrange them), and a little bit shifting can isolate the specific height you want (since every height will be split between 2 adjacent bytes).
I finally used the HDF5 file format, with pyTables.
The outcome is a ~20MB file for the exact same data, directly usable by the application.
Here is how I create it:
import tables
db_struct = {
'x': tables.Int16Col(),
'y': tables.Int16Col(),
'h': tables.Int16Col()
}
h5file = tables.open_file("my_file.h5", mode="w", title='Map')
filters = tables.Filters(complevel=9, complib='lzo')
group = h5file.create_group('/', 'group', 'Group')
table = h5file.create_table(group, 'map', db_struct, filters=filters)
heights = table.row
for y in range(0, int(MAP_HEIGHT)):
for x in range(0, int(MAP_WIDTH)):
heights['x'] = x
heights['y'] = y
heights['h'] = h
heights.append()
table.flush()
table.flush()
h5file.close()

Python turn array of booleans to binary

I am preparing new driver for one of our new hardware devices.
One of the option to set it up, is where one byte, has 8 options in it. Every bite turns on or off something else.
So, basically what I need to do is, take 8 zeros or ones and create one byte of them.
What I did is, I have prepares helper function for it:
#staticmethod
def setup2byte(setup_array):
"""Turn setup array (of 8 booleans) into byte"""
data = ''
for b in setup_array:
data += str(int(b))
return int(data, 2)
Called like this:
settings = [echo, reply, presenter, presenter_brake, doors_action, header, ticket_sensor, ext_paper_sensor]
data = self.setup2byte(settings)
packet = "{0:s}{1:s}{2:d}{3:s}".format(CONF_STX, 'P04', data, ETX)
self.queue_command.put(packet)
and I wonder if there is easier way how to do it. Some built in function or something like that. Any ideas?
I believe you want this:
convert2b = lambda ls: bytes("".join([str(int(b)) for b in ls]), 'utf-8')
Where ls is a list of booleans. Works in python 2.7 and 3.x. Alternative more like your original:
convert2b = lambda ls: int("".join([str(int(b)) for b in ls]), 2)
that's basically what you are already doing, but shorter:
data = int(''.join(['1' if i else '0' for i in settings]), 2)
But here is the answer you are looking for:
Bool array to integer
I think the previous answers created 8 bytes. This solution creates one byte only
settings = [False,True,False,True,True,False,False,True]
# LSB first
integerValue = 0
# init value of your settings
for idx, setting in enumerate(settings):
integerValue += setting*2**idx
# initialize an empty byte
mybyte = bytearray(b'\x00')
mybyte[0] =integerValue
print (mybyte)
For more example visit this great site: binary python

Python memoryerror creating large dictionary

I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.
class Node(object):
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "node":
lat = float(elem.get('lat'))
lon = float(elem.get('lon'))
osm_id = int(elem.get('id'))
nodes[osm_id] = Node(osm_id, lat, lon)
root.clear()
I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.
How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.
Assuming you have tons of Nodes being created, you might consider using __slots__ to predefine a fixed set of attributes for each Node. This removes the overhead of storing a per-instance __dict__ (in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node by a factor of ~5x (less on Python 3.3+ where shared key __dict__ reduces the per-instance memory cost for free).
It's easy to do, just change the declaration of Node to:
class Node(object):
__slots__ = 'osmid', 'latitude', 'longitude', 'count'
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:
>>> import sys
>>> ... define Node without __slots___
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
248
>>> ... define Node with __slots__
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) # It has no __dict__ now
72
And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__ would be similar (one pointer sized variable larger IIRC), while the cost without __slots__ would go up by a few hundred bytes.
Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.

Flyweight pattern - Memory footprint

I'm learning Python and I've thought it would be a nice excuse to refresh my pattern knowledge and in that case, the Flyweight pattern.
I created two small programs, one that is not optimized and one is implementing the Flyweight pattern. For my tests purposes, I'm creating an army of 1'000'000 Enemy objects. Each enemy can be of three types (Soldier, Ninja or Chief) and I assign a motto to each type.
What I would like to check is that, with my un-optimized program, I get 1'000'000 enemies with, for each and everyone of them, a type and a "long" string containing the motto.
With the optimized code, I'd like to create only three objects (EnemyType) matching each type and containing only 3 times the motto's strings. Then, I add a member to each Enemy, pointing to the desired EnemyType.
Now the code (excerpts only) :
Un-optimized program
In this version, each enemy stores its type and motto.
enemyList = []
enemyTypes = {'Soldier' : 'Sir, yes sir!', 'Ninja' : 'Always behind you !', 'Chief' : 'Always behind ... lot of lines '}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 # random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.keys()[randomTypeIndex]
# Here, the type and motto are parameters of EACH enemy object.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Optimized program
In this version, each enemy has a member of an EnemyType object that stores its type and motto. Only three instances of EnemyType are created and I should see the impact in my memory footprint.
enemyList = []
soldierEnemy = EnemyType('Soldier', 'Sir, yes sir!')
ninjaEnemy = EnemyType('Ninja', 'Always behind you !')
chiefEnemy = EnemyType('Chief', 'Always behind ... lot of lines.')
enemyTypes = {'Soldier' : soldierEnemy, 'Ninja' : ninjaEnemy, 'Chief' : chiefEnemy}
enemyCount = {}
for i in range(1000000):
randomPosX = 0 # random.choice(range(1000))
randomPosY = 0 # random.choice(range(1000))
randomTypeIndex = 0 #random.choice(range(0,len(enemyTypes)))
enemyType = enemyTypes.values()[randomTypeIndex]
# Here, each enemy only has a reference on its type.
enemyList.append(Enemy(randomPosX, randomPosY, enemyType))
Now I'm using this to get my memory footprint (at the very last lines before my application closes itself) :
import os
import psutil
...
# return the memory usage in MB
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20)
My problem is that, I don't see any difference between the output of my two programs :
Optimized = 384.0859375 Mb
Un-optimized = 383.40234375 Mb
Is it the proper tool to get the memory footprint ? I'm new to Python so it could be a problem with my code but I checked my EnemyType objects in the second solution and I indeed have only three occurences. I therefore should have 3 motto strings instead of 1'000'000.
I've read about a tool called Heapy for Python, would it be more accurate here ?
As far as I could tell from the code in the question, in both cases you're just using references for the same small number of instances anyway.
Take the "unoptimized" version's:
enemyList.append(Enemy(randomPosX, randomPosY, enemyType, enemyTypes[enemyType]))
Indeed enemyTypes[enemyType] is a string, which might have made you think you have many instances of strings. But in reality, each of your objects has one of the three same string objects.
You can check this by comparing the ids of the members. Make a set of the ids, and see if it is larger than 3.

numpy.choose 32 choice limitation

Low and behold, I ran into a regression in numpy.choose after upgrading to 1.5.1. Past versions (and numeric) supported an, as far as I could tell, unlimited number of potential choices. The "new" choose is limited to 32. Here is a post where another user laments the regression.
I have a list with 100 choices (0-99) that I was using to modify an array. As a work around, I am using the following code. Understandably, it is 7 times slower than using choose. I am not a C programmer, and while I would to get in an fix the numpy issue, I wonder what other potentially faster work arounds exist. Thoughts?
d={...} #A dictionary with my keys and their new mappings
for key, value in d.iteritems():
array[array==key]=value
I gather that d has the keys 0 to 99. In that case, the solution is really simple. First, write the values of d in a NumPy array values, in a way that d[i] == values[i] – this seems to be the natural data structure for these values anyway. Then you can access the new array with the values replaced by
values[array]
If you want to modify array in place, simply do
array[:] = values[array]
In the Numpy documentation, there is an example of how a simplified version of the choose function could look like.
[...] this function is less simple than it might seem from the
following code description (below ndi = numpy.lib.index_tricks):
np.choose(a,c) == np.array([c[a[I]][I] for I in ndi.ndindex(a.shape)]).
See https://docs.scipy.org/doc/numpy/reference/generated/numpy.choose.html
Putting this into a function could look like this:
import numpy
def choose(selector, choices):
"""
A simplified version of the numpy choose function to workaround the 32
choices limit.
"""
return numpy.array([choices[selector[idx]][idx] for idx in numpy.lib.index_tricks.ndindex(selector.shape)]).reshape(selector.shape)
I am not sure how this translates in terms of efficiency and when exactly this breaks down when compared to the numpy.choose function. But it worked fine for me. Note that the patched function assumes that the entries in the choices are subscriptable.
I'm not sure about efficiency and it's not in-place (nb: I don't use numpy that often - so somewhat rusty):
import numpy as np
d = {0: 5, 1: 3, 2: 20}
data = np.array([[1, 0, 2], [2, 1, 1], [1, 0, 1]])
new_data = np.array([d.get(i, i) for i in data.flat]).reshape(data.shape) # adapt for list/other
When colorizing microscopy images of mouse embryos I ran into a need for
a choose implementation where the number of choices was in the hundreds
(hundreds of mouse embryo cells).
https://github.com/flatironinstitute/mouse_embryo_labeller
As I was not sure whether the above suggestions were general or fast
I wrote this alternative:
import numpy as np
def big_choose(indices, choices):
"Alternate to np.choose that supports more than 30 choices."
indices = np.array(indices)
if (indices.max() <= 30) or (len(choices) <= 31):
# optimized fallback
choices = choices[:31]
return np.choose(indices, choices)
result = 0
while (len(choices) > 0) and not np.all(indices == -1):
these_choices = choices[:30]
remaining_choices = choices[30:]
shifted_indices = indices + 1
too_large_indices = (shifted_indices > 30).astype(np.int)
clamped_indices = np.choose(too_large_indices, [shifted_indices, 0])
choices_with_default = [result] + list(these_choices)
result = np.choose(clamped_indices, choices_with_default)
choices = remaining_choices
if len(choices) > 0:
indices = indices - 30
too_small = (indices < -1).astype(np.int)
indices = np.choose(too_small, [indices, -1])
return result
Note that the generalized function uses the underlying implementation
when it can.

Categories