I generated by pickle.dump() a file with the size of about 5GB. It takes about half a day to load this file and about 50GM RAM. My question is whether is it possible to read this file by accessing separately entry by entry (one at a time) rather than loading it all into memory, or if you have any other suggestion of how to access data in such a file.
Many thanks.
There is absolutely no question that this should be done using a database, rather than pickle- databases are designed for exactly this kind of problem.
Here is some code to get you started, which puts a dictionary into a sqllite database and shows an example of retrieving a value. To get this to work with your actual dictionary rather than my toy example, you'll need to learn more about SQL, but fortunately there are many excellent resources available online. In particular, you might want to learn how to use SQLAlchemy, which is an "Object Relational Mapper" that can make working with databases as intuitive as working with objects.
import os
import sqlite3
# an enormous dictionary too big to be stored in pickle
my_huge_dictionary = {"A": 1, "B": 2, "C": 3, "D": 4}
# create a database in the file my.db
conn = sqlite3.connect('my.db')
c = conn.cursor()
# Create table with two columns: k and v (for key and value). Here your key
# is assumed to be a string of length 10 or less, and your value is assumed
# to be an integer. I'm sure this is NOT the structure of your dictionary;
# you'll have to read into SQL data types
c.execute("""
create table dictionary (
k char[10] NOT NULL,
v integer NOT NULL,
PRIMARY KEY (k))
""")
# dump your enormous dictionary into a database. This will take a while for
# your large dictionary, but you should do it only once, and then in the future
# make changes to your database rather than to a pickled file.
for k, v in my_huge_dictionary.items():
c.execute("insert into dictionary VALUES ('%s', %d)" % (k, v))
# retrieve a value from the database
my_key = "A"
c.execute("select v from dictionary where k == '%s'" % my_key)
my_value = c.next()[0]
print my_value
Good luck!
You could try an object oriented database, if your data is heterogeneous - using ZODB - which internally uses pickle, but in a way designed, and time proved - to manage large amounts of data, you probably will need little changes to your application
ZODB is the heart of Zope - a Python application server - which today powers Plone among other applications.
It can be used stand-alone, without all of Zope's tools - you should check it, doubly so if your data is not fit for SQL.
http://www.zodb.org/
Related
My script produces data in the following format:
dictionary = {
(.. 42 values: None, 1 or 2 ..): {
0: 0.4356, # ints as keys, floats as values
1: 0.2355,
2: 0.4352,
...
6: 0.6794
},
...
}
where:
(.. 42 values: None, 1 or 2 ..) is a game state
inner dict stores calculated values of actions which are possible in that state
The problem is that the state space is very big (millions of states), so the whole data stucture cannot be stored in memory. That's why I'm looking for a database engine which would fit my needs and I could use with Python. I need to get the list of actions and their values in the given state (previously mentioned tuple of 42 values) and to modify value of given action in given state.
Check out ZODB: http://www.zodb.org/en/latest/
It's natve object DB for Python that supports transactions, caching, pluggable layers, pack operations (for keeping history) and BLOBs.
You can use a key-value cache solution. A good one is Redis. It`s very fast and simple, written on the C and more over than just a key value cache. Integration with python just several lines of code. The redis is also can be scaled very easy for the really big data. I worked in the game industry and understand what I am talking about.
Also, as already mentioned here, you can use more complex solution, not a cache, the database PostgresSQL. Now it supports a JSON binary format field - JSONB. I think the best python database ORM is the SQLAlchemy. It supports PostgresSQL out of the box. I will use this one in my code block. For example, you have a table
class MobTable(db.Model):
tablename = 'mobs'
id = db.Column(db.Integer, primary_key=True)
stats = db.Column(JSONB, index=True, default={})
If your have a mob with such json stats
{
id: 1,
title: 'UglyOrk',
resists: {cold: 13}
}
You can search all mobs with the not null cold resists
expr = MobTable.stats[("resists", "cold")]
q = (session.query(MobTable.id, expr.label("cold_protected"))
.filter(expr != None)
.all())
I recommend you use HD5f. It's a data base format that works perfectly with Python (it is specifically developed for Python) and stores the data in binary format. This reduces the size of the data to be stored a great extent! More importantly it gives you the ability of random access which I believe serves for your purposes. Also, if you do not use any compression method you will retrieve the data with the highest possible speed.
You can also store it as JSONB in PostgreSQL DB.
For connecting with PostgreSQL you can use psycopg2, which is compliant with Python Database API Specification v2.0.
I'm trying to store a compressed dictionary in my sqlite database. First, I convert the dict to a string using json.dumps, which seems to work fine. Storing this string in DB also works.
In the next step, I'm compressing my string using encode("zlib"). But storing the resulting string in my db throws an error.
mydict = {"house":"Haus","cat":"Katze","red":u'W\xe4yn',"dict":{"1":"asdfhgjl ahsugoh ","2":"s dhgsuoadhu gohsuohgsduohg"}}
dbCommand("create table testTable (ch1 varchar);")
# convert dictionary to string
jch1 = json.dumps(mydict,ensure_ascii=True)
print(jch1)
# store uncompressed values
dbCommand("insert into testTable (ch1) values ('%s');"%(jch1))
# compress json strings
cjch1 = jch1.encode("zlib")
print(cjch1)
# store compressed values
dbCommand("insert into testTable (ch1) values ('%s');"%(cjch1))
The first print outputs:
{"house": "Haus", "dict": {"1": "asdfhgjl ahsugoh ", "2": "s dhgsuoadhu gohsuohgsduohg"}, "red": "W\u00e4yn", "cat": "Katze"}
The second print is not readable of course:
xワフ1テPCᆵyfᅠネノ õ
Do I need to do any additional conversion before?
Looking forward to any helping hint!
Let's approach this from behind: why are you using gzip encoding in the first place? Do you think you need to save space in your database? Have you checked how long the dictionary strings will be in production? These strings will need to have a minimal length before compression will actually save storage space (for small input strings the output might even be larger than the input!). If that actually saves some disk space: did you think through whether the additional CPU load and processing time due to gzip encoding and decoding are worth the saved space?
Other than that: the result of gzip/zlib compression is a binary blob. In Python 2, this should be of type str. In Python 3, this should be type bytes. In any case, the database needs to know that whatever you are storing there is binary data! VARCHAR is not the right data type for this endeavor. What follows is a quote from MySQL docs:
Also, if you want to store binary values such as results from an
encryption or compression function that might contain arbitrary byte
values, use a BLOB column rather than a CHAR or VARCHAR column, to
avoid potential problems with trailing space removal that would change
data values.
The same consideration holds true for other databases. Also in case of SQLite you must use the BLOB data type (see docs) for storing binary data (if you want to ensure to get back the exact same data as you have put in before :-)).
Thanks a lot Jan-Philip,
you showed me the right solution. My table needs to have a BLOB entry to store the data. Here is the working code:
mydict = {"house":"Haus","cat":"Katze","red":u'W\xe4yn',"dict":{"1":"asdfhgjl ahsugoh ","2":"s dhgsuoadhu gohsuohgsduohg"}}
curs.execute("create table testTable (ch1 BLOB);")
# convert dictionary to string
jch1 = json.dumps(mydict,ensure_ascii=True)
cjch1 = jch1.encode("zlib")
# store compressed values
curs.execute('insert into testTable values (?);', [buffer(cjch1)])
db.commit()
I want to build a table in python with three columns and later on fetch the values as necessary.
I am thinking dictionaries are the best way to do it, which has key mapping to two values.
|column1 | column 2 | column 3 |
| MAC | PORT NUMBER | DPID |
| Key | Value 1 | Value 2 |
proposed way :
// define a global learning table
globe_learning_table = defaultdict(set)
// add port number and dpid of a switch based on its MAC address as a key
// packet.src will give you MAC address in this case
globe_learning_table[packet.src].add(event.port)
globe_learning_table[packet.src].add(dpid_to_str(connection.dpid))
// getting value of DPID based on its MAC address
globe_learning_table[packket.src][????]
I am not sure if one key points to two values how can I get the particular value associated with that key.
I am open to use any another data structure as well, if it can build this dynamic table and give me the particular values when necessary.
Why a dictionary? Why not a list of named tuples, or a collection (list, dictionary) of objects from some class which you define (with attributes for each column)?
What's wrong with:
class myRowObj(object):
def __init__(self, mac, port, dpid):
self.mac = mac
self.port = port
self.dpid = dpid
myTable = list()
for each in some_inputs:
myTable.append(myRowObj(*each.split())
... or something like that?
(Note: myTable can be a list, or a dictionary or whatever is suitable to your needs. Obviously if it's a dictionary then you have to ask what sort of key you'll use to access these "rows").
The advantage of this approach is that your "row objects" (which you'd name in some way that made more sense to your application domain) can implement whatever semantics you choose. These objects can validate and convert any values supplied at instantiation, compute any derived values, etc. You can also define a string and code representations of your object (implicit conversions for when one of your rows is used as a string or in certain types of development and debugging or serialization (_str_ and _repr_ special methods, for example).
The named tuples (added in Python 2.6) are a sort of lightweight object class which can offer some performance advantages and lighter memory footprint over normal custom classes (for situations where you only want the named fields without binding custom methods to these objects, for example).
Something like this perhaps?
>>> global_learning_table = collections.defaultdict(PortDpidPair)
>>> PortDpidPair = collections.namedtuple("PortDpidPair", ["port", "dpid"])
>>> global_learning_table = collections.defaultdict(collections.namedtuple('PortDpidPair', ['port', 'dpid']))
>>> global_learning_table["ff:" * 7 + "ff"] = PortDpidPair(80, 1234)
>>> global_learning_table
defaultdict(<class '__main__.PortDpidPair'>, {'ff:ff:ff:ff:ff:ff:ff:ff': PortDpidPair(port=80, dpid=1234)})
>>>
Named tuples might be appropriate for each row, but depending on how large this table is going to be, you may be better off with a sqlite db or something similar.
If it is small enough to store in memory and you want it to be a data structure, you could create an class that contains Values 1 & 2 and use that as the value for your dictionary mapping.
However, as Mr E pointed out, it is probably better design to use a database to store the information and retrieve as necessary from there. This will likely not result in significant performance loss.
Another option to keep in mind is an in-memory SQLite table. See the Python SQLite docs for a basic example:
11.13. sqlite3 — DB-API 2.0 interface for SQLite databases — Python v2.7.5 documentation
http://docs.python.org/2/library/sqlite3.html
I think you're getting two distinct objectives mixed up. You want a representative data structure, and (as I read it) you want to print it in a readable form. What gets printed as a table is not stored internally in the computer in two dimensions; the table presentation is a visual metaphor.
Assuming I'm right about what you want to accomplish, the way I'd go about it is by a) keeping it simple and b) using the right modules to save effort.
The simplest data structure that correctly represents your information is in my opinion a dictionary within a dictionary. Like this:
foo = {'00:00:00:00:00:00': {'port':22, 'dpid':42},
'00:00:00:00:00:01': {'port':23, 'dpid':43}}
The best module I have found for quick and dirty table printing is prettytable. Your code would look something like this:
foo = {'00:00:00:00:00:00': {'port':22, 'dpid':42},
'00:00:00:00:00:01': {'port':23, 'dpid':43}}
t = PrettyTable(['MAC', 'Port', 'dpid'])
for row in foo:
t.add_row([row, foo[row]['port'], foo[row]['dpid']])
print t
Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)
If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).
If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.
I want 3 columns to have 9 different values, like a list in Python.
Is it possible? If not in SQLite, then on another database engine?
You must serialize the list (or other Python object) into a string of bytes, aka "BLOB";-), through your favorite means (marshal is good for lists of elementary values such as numbers or strings &c, cPickle if you want a very general solution, etc), and deserialize it when you fetch it back. Of course, that basically carries the list (or other Python object) as a passive "payload" -- can't meaningfully use it in WHERE clauses, ORDER BY, etc.
Relational databases just don't deal all that well with non-atomic values and would prefer other, normalized alternatives (store the list's items in a different table which includes a "listID" column, put the "listID" in your main table, etc). NON-relational databases, while they typically have limitations wrt relational ones (e.g., no joins), may offer more direct support for your requirement.
Some relational DBs do have non-relational extensions. For example, PostGreSQL supports an array data type (not quite as general as Python's lists -- PgSQL's arrays are intrinsically homogeneous).
Generally, you do this by stringifying the list (with repr()), and then saving the string. On reading the string from the database, use eval() to re-create the list. Be careful, though that you are certain no user-generated data can get into the column, or the eval() is a security risk.
Your question is difficult to understand. Here it is again:
I want 3 columns to have 9 different values, like a list in Python. Is it possible? If not in SQLite, then on another database engine?
Here is what I believe you are asking: is it possible to take a Python list of 9 different values, and save the values under a particular column in a database?
The answer to this question is "yes". I suggest using a Python ORM library instead of trying to write the SQL code yourself. This example code uses Autumn:
import autumn
import autumn.util
from autumn.util import create_table
# get a database connection object
my_test_db = autumn.util.AutoConn("my_test.db")
# code to create the database table
_create_sql = """\
DROP TABLE IF EXISTS mytest;
CREATE TABLE mytest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
value INTEGER NOT NULL,
UNIQUE(value)
);"""
# create the table, dropping any previous table of same name
create_table(my_test_db, _create_sql)
# create ORM class; Autumn introspects the database to find out columns
class MyTest(autumn.model.Model):
db = my_test_db
lst = [3, 6, 9, 2, 4, 8, 1, 5, 7] # list of 9 unique values
for n in lst:
row = MyTest(value=n) # create MyTest() row instance with value initialized
row.save() # write the data to the database
Run this code, then exit Python and run sqlite3 my_test.db. Then run this SQL command inside SQLite: select * from mytest; Here is the result:
1|3
2|6
3|9
4|2
5|4
6|8
7|1
8|5
9|7
This example pulls values from one list, and uses the values to populate one column from the database. It could be trivially extended to add additional columns and populate them as well.
If this is not the answer you are looking for, please rephrase your request to clarify.
P.S. This example uses autumn.util. The setup.py included with the current release of Autumn does not install util.py in the correct place; you will need to finish the setup of Autumn by hand.
You could use a more mature ORM such as SQLAlchemy or the ORM from Django. However, I really do like Autumn, especially for SQLite.