Which database to store very large nested Python dicts? - python

My script produces data in the following format:
dictionary = {
(.. 42 values: None, 1 or 2 ..): {
0: 0.4356, # ints as keys, floats as values
1: 0.2355,
2: 0.4352,
...
6: 0.6794
},
...
}
where:
(.. 42 values: None, 1 or 2 ..) is a game state
inner dict stores calculated values of actions which are possible in that state
The problem is that the state space is very big (millions of states), so the whole data stucture cannot be stored in memory. That's why I'm looking for a database engine which would fit my needs and I could use with Python. I need to get the list of actions and their values in the given state (previously mentioned tuple of 42 values) and to modify value of given action in given state.

Check out ZODB: http://www.zodb.org/en/latest/
It's natve object DB for Python that supports transactions, caching, pluggable layers, pack operations (for keeping history) and BLOBs.

You can use a key-value cache solution. A good one is Redis. It`s very fast and simple, written on the C and more over than just a key value cache. Integration with python just several lines of code. The redis is also can be scaled very easy for the really big data. I worked in the game industry and understand what I am talking about.
Also, as already mentioned here, you can use more complex solution, not a cache, the database PostgresSQL. Now it supports a JSON binary format field - JSONB. I think the best python database ORM is the SQLAlchemy. It supports PostgresSQL out of the box. I will use this one in my code block. For example, you have a table
class MobTable(db.Model):
tablename = 'mobs'
id = db.Column(db.Integer, primary_key=True)
stats = db.Column(JSONB, index=True, default={})
If your have a mob with such json stats
{
id: 1,
title: 'UglyOrk',
resists: {cold: 13}
}
You can search all mobs with the not null cold resists
expr = MobTable.stats[("resists", "cold")]
q = (session.query(MobTable.id, expr.label("cold_protected"))
.filter(expr != None)
.all())

I recommend you use HD5f. It's a data base format that works perfectly with Python (it is specifically developed for Python) and stores the data in binary format. This reduces the size of the data to be stored a great extent! More importantly it gives you the ability of random access which I believe serves for your purposes. Also, if you do not use any compression method you will retrieve the data with the highest possible speed.

You can also store it as JSONB in PostgreSQL DB.
For connecting with PostgreSQL you can use psycopg2, which is compliant with Python Database API Specification v2.0.

Related

PickleType hysteresis: PickleType column sometimes doesn't auto-pickle list, sometimes does

I'm trying to store Python objects (including lists) as blobs in PickleType columns.
re PickleType:
| PickleType builds upon the Binary type to apply Python's
| ``pickle.dumps()`` to incoming objects, and ``pickle.loads()`` on
| the way out, allowing any pickleable Python object to be stored as
| a serialized binary field.
Two ways to populate record fields : first works, second fails
When I insert records with (correctly-mapped) fields using the explicit table definitions used to generate the table schema, all works fine -- lists are auto-pickled by the PickleType Column and stored in the table as blobs, every record is submitted as I expect it.
(By "explicit table definitions used to generate the table schema" I mean:
class MyTable1(base_object):
__tablename__ = 'My Table 1'
#
Day = Column(Integer, primary_key= True)
Field1 = Column(PickleType)
Field2 = Column(PickleType)
)
HOWEVER, even with the exact same data in all the same forms, if I...:
Declare the database (generate the schema)
Then use the automap_base(metadata= MetaData(engine)) to extract the db's schema/metadata
Use the table definitions stored in the automap_base object (automapBaseObject.classes['My Table']) to generate and submit new (or even any) records. ===> FAILS specifically on PickleType Columns populated with list objects. Records submitted to the same columns without any lists work perfectly fine.
---- >(Submitting a float to a PickleType column also failed, but I changed such columns to be Float/VARCHAR instead, circumventing this problem.)
e.g. Step 3.
MyTable1= automapBaseObject.classes['My Table 1']
record = MyTable1(**{'Day': 1, 'Field1': [1, 2], 'Field2': [np.arange(10), np.arange(20)]})
session.add(record)
session.commit()
>>> sqlalchemy.exc.StatementError: (builtins.TypeError) memoryview: a bytes-like object is required, not 'list'
Two issues here:
Why does SQLAlchemy's PickleType behave differently when populating fields with simple native Python types using the original schema definition tables (resulting in the expected behavior) vs. when using the automap_base retrieved tables?
Why does PickleType not auto-pickle lists, floats, et al in the automap_base() case? (It seems like the error produced comes from the object not being pickled, then being submitted via sqlite where it expects the object to be a binary type but is not and thus fails).
I guess a way to avoid this problem is to manually pickle all the list fields. But that seems clunky and I imagine there'd be a way to turn off this type-determination-before-deciding-to-pickle behavior. That is, ignoring the hysteresis to begin with.
Pickletype columns are created as BLOB columns in SQLite. Dumping and loading of objects is handled at the Python level. Automap uses reflection to build its models. By default, it has no way of knowing that some BLOB columns in the database should be PickleType columns in a model.
The simplest way to get round this is to provide automap with metadata that contains the appropriate type mappings. The base_object has such a metadata attribute.
This should work:
auto_base = automap_base(metadata=base_object.metadata)
auto_base.prepare(engine, reflect=True)
This is documented at Generating Mappings from an Existing MetaData.

Store dictionary in database

I create a Berkeley database, and operate with it using bsddb module. And I need to store there information in a style, like this:
username = '....'
notes = {'name_of_note1':{
'password':'...',
'comments':'...',
'title':'...'
}
'name_of_note2':{
#keys same as previous, but another values
}
}
This is how I open database
db = bsddb.btopen['data.db','c']
How do I do that ?
So, first, I guess you should open your database using parentheses:
db = bsddb.btopen('data.db','c')
Keep in mind that Berkeley's pattern is key -> value, where both key and value are string objects (not unicode). The best way in your case would be to use:
db[str(username)] = json.dumps(notes)
since your notes are compatible with the json syntax.
However, this is not a very good choice, say, if you want to query only usernames' comments. You should use a relational database, such as sqlite, which is also built-in in Python.
A simple solution was described by #Falvian.
For a start there is a column pattern in ordered key/value store. So the key/value pattern is not the only one.
I think that bsddb is viable solution when you don't want to rely on sqlite. The first approach is to create a documents = bsddb.btopen['documents.db','c'] and store inside json values. Regarding the keys you have several options:
Name the keys yourself, like you do "name_of_note_1", "name_of_note_2"
Generate random identifiers using uuid.uuid4 (don't forget to check it's not already used ;)
Or use a row inside this documents with key=0 to store a counter that you will use to create uids (unique identifiers).
If you use integers don't forget to pack them with lambda x: struct.pack('>q', uid) before storing them.
If you need to create index. I recommend you to have a look at my other answer introducting composite keys to build index in bsddb.

Creating a table in python

I want to build a table in python with three columns and later on fetch the values as necessary.
I am thinking dictionaries are the best way to do it, which has key mapping to two values.
|column1 | column 2 | column 3 |
| MAC | PORT NUMBER | DPID |
| Key | Value 1 | Value 2 |
proposed way :
// define a global learning table
globe_learning_table = defaultdict(set)
// add port number and dpid of a switch based on its MAC address as a key
// packet.src will give you MAC address in this case
globe_learning_table[packet.src].add(event.port)
globe_learning_table[packet.src].add(dpid_to_str(connection.dpid))
// getting value of DPID based on its MAC address
globe_learning_table[packket.src][????]
I am not sure if one key points to two values how can I get the particular value associated with that key.
I am open to use any another data structure as well, if it can build this dynamic table and give me the particular values when necessary.
Why a dictionary? Why not a list of named tuples, or a collection (list, dictionary) of objects from some class which you define (with attributes for each column)?
What's wrong with:
class myRowObj(object):
def __init__(self, mac, port, dpid):
self.mac = mac
self.port = port
self.dpid = dpid
myTable = list()
for each in some_inputs:
myTable.append(myRowObj(*each.split())
... or something like that?
(Note: myTable can be a list, or a dictionary or whatever is suitable to your needs. Obviously if it's a dictionary then you have to ask what sort of key you'll use to access these "rows").
The advantage of this approach is that your "row objects" (which you'd name in some way that made more sense to your application domain) can implement whatever semantics you choose. These objects can validate and convert any values supplied at instantiation, compute any derived values, etc. You can also define a string and code representations of your object (implicit conversions for when one of your rows is used as a string or in certain types of development and debugging or serialization (_str_ and _repr_ special methods, for example).
The named tuples (added in Python 2.6) are a sort of lightweight object class which can offer some performance advantages and lighter memory footprint over normal custom classes (for situations where you only want the named fields without binding custom methods to these objects, for example).
Something like this perhaps?
>>> global_learning_table = collections.defaultdict(PortDpidPair)
>>> PortDpidPair = collections.namedtuple("PortDpidPair", ["port", "dpid"])
>>> global_learning_table = collections.defaultdict(collections.namedtuple('PortDpidPair', ['port', 'dpid']))
>>> global_learning_table["ff:" * 7 + "ff"] = PortDpidPair(80, 1234)
>>> global_learning_table
defaultdict(<class '__main__.PortDpidPair'>, {'ff:ff:ff:ff:ff:ff:ff:ff': PortDpidPair(port=80, dpid=1234)})
>>>
Named tuples might be appropriate for each row, but depending on how large this table is going to be, you may be better off with a sqlite db or something similar.
If it is small enough to store in memory and you want it to be a data structure, you could create an class that contains Values 1 & 2 and use that as the value for your dictionary mapping.
However, as Mr E pointed out, it is probably better design to use a database to store the information and retrieve as necessary from there. This will likely not result in significant performance loss.
Another option to keep in mind is an in-memory SQLite table. See the Python SQLite docs for a basic example:
11.13. sqlite3 — DB-API 2.0 interface for SQLite databases — Python v2.7.5 documentation
http://docs.python.org/2/library/sqlite3.html
I think you're getting two distinct objectives mixed up. You want a representative data structure, and (as I read it) you want to print it in a readable form. What gets printed as a table is not stored internally in the computer in two dimensions; the table presentation is a visual metaphor.
Assuming I'm right about what you want to accomplish, the way I'd go about it is by a) keeping it simple and b) using the right modules to save effort.
The simplest data structure that correctly represents your information is in my opinion a dictionary within a dictionary. Like this:
foo = {'00:00:00:00:00:00': {'port':22, 'dpid':42},
'00:00:00:00:00:01': {'port':23, 'dpid':43}}
The best module I have found for quick and dirty table printing is prettytable. Your code would look something like this:
foo = {'00:00:00:00:00:00': {'port':22, 'dpid':42},
'00:00:00:00:00:01': {'port':23, 'dpid':43}}
t = PrettyTable(['MAC', 'Port', 'dpid'])
for row in foo:
t.add_row([row, foo[row]['port'], foo[row]['dpid']])
print t

Loading a huge Python Pickle dictionary

I generated by pickle.dump() a file with the size of about 5GB. It takes about half a day to load this file and about 50GM RAM. My question is whether is it possible to read this file by accessing separately entry by entry (one at a time) rather than loading it all into memory, or if you have any other suggestion of how to access data in such a file.
Many thanks.
There is absolutely no question that this should be done using a database, rather than pickle- databases are designed for exactly this kind of problem.
Here is some code to get you started, which puts a dictionary into a sqllite database and shows an example of retrieving a value. To get this to work with your actual dictionary rather than my toy example, you'll need to learn more about SQL, but fortunately there are many excellent resources available online. In particular, you might want to learn how to use SQLAlchemy, which is an "Object Relational Mapper" that can make working with databases as intuitive as working with objects.
import os
import sqlite3
# an enormous dictionary too big to be stored in pickle
my_huge_dictionary = {"A": 1, "B": 2, "C": 3, "D": 4}
# create a database in the file my.db
conn = sqlite3.connect('my.db')
c = conn.cursor()
# Create table with two columns: k and v (for key and value). Here your key
# is assumed to be a string of length 10 or less, and your value is assumed
# to be an integer. I'm sure this is NOT the structure of your dictionary;
# you'll have to read into SQL data types
c.execute("""
create table dictionary (
k char[10] NOT NULL,
v integer NOT NULL,
PRIMARY KEY (k))
""")
# dump your enormous dictionary into a database. This will take a while for
# your large dictionary, but you should do it only once, and then in the future
# make changes to your database rather than to a pickled file.
for k, v in my_huge_dictionary.items():
c.execute("insert into dictionary VALUES ('%s', %d)" % (k, v))
# retrieve a value from the database
my_key = "A"
c.execute("select v from dictionary where k == '%s'" % my_key)
my_value = c.next()[0]
print my_value
Good luck!
You could try an object oriented database, if your data is heterogeneous - using ZODB - which internally uses pickle, but in a way designed, and time proved - to manage large amounts of data, you probably will need little changes to your application
ZODB is the heart of Zope - a Python application server - which today powers Plone among other applications.
It can be used stand-alone, without all of Zope's tools - you should check it, doubly so if your data is not fit for SQL.
http://www.zodb.org/

Is it possible to save a list of values into a SQLite column?

I want 3 columns to have 9 different values, like a list in Python.
Is it possible? If not in SQLite, then on another database engine?
You must serialize the list (or other Python object) into a string of bytes, aka "BLOB";-), through your favorite means (marshal is good for lists of elementary values such as numbers or strings &c, cPickle if you want a very general solution, etc), and deserialize it when you fetch it back. Of course, that basically carries the list (or other Python object) as a passive "payload" -- can't meaningfully use it in WHERE clauses, ORDER BY, etc.
Relational databases just don't deal all that well with non-atomic values and would prefer other, normalized alternatives (store the list's items in a different table which includes a "listID" column, put the "listID" in your main table, etc). NON-relational databases, while they typically have limitations wrt relational ones (e.g., no joins), may offer more direct support for your requirement.
Some relational DBs do have non-relational extensions. For example, PostGreSQL supports an array data type (not quite as general as Python's lists -- PgSQL's arrays are intrinsically homogeneous).
Generally, you do this by stringifying the list (with repr()), and then saving the string. On reading the string from the database, use eval() to re-create the list. Be careful, though that you are certain no user-generated data can get into the column, or the eval() is a security risk.
Your question is difficult to understand. Here it is again:
I want 3 columns to have 9 different values, like a list in Python. Is it possible? If not in SQLite, then on another database engine?
Here is what I believe you are asking: is it possible to take a Python list of 9 different values, and save the values under a particular column in a database?
The answer to this question is "yes". I suggest using a Python ORM library instead of trying to write the SQL code yourself. This example code uses Autumn:
import autumn
import autumn.util
from autumn.util import create_table
# get a database connection object
my_test_db = autumn.util.AutoConn("my_test.db")
# code to create the database table
_create_sql = """\
DROP TABLE IF EXISTS mytest;
CREATE TABLE mytest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
value INTEGER NOT NULL,
UNIQUE(value)
);"""
# create the table, dropping any previous table of same name
create_table(my_test_db, _create_sql)
# create ORM class; Autumn introspects the database to find out columns
class MyTest(autumn.model.Model):
db = my_test_db
lst = [3, 6, 9, 2, 4, 8, 1, 5, 7] # list of 9 unique values
for n in lst:
row = MyTest(value=n) # create MyTest() row instance with value initialized
row.save() # write the data to the database
Run this code, then exit Python and run sqlite3 my_test.db. Then run this SQL command inside SQLite: select * from mytest; Here is the result:
1|3
2|6
3|9
4|2
5|4
6|8
7|1
8|5
9|7
This example pulls values from one list, and uses the values to populate one column from the database. It could be trivially extended to add additional columns and populate them as well.
If this is not the answer you are looking for, please rephrase your request to clarify.
P.S. This example uses autumn.util. The setup.py included with the current release of Autumn does not install util.py in the correct place; you will need to finish the setup of Autumn by hand.
You could use a more mature ORM such as SQLAlchemy or the ORM from Django. However, I really do like Autumn, especially for SQLite.

Categories