Storing zipped string into sqlite database using python - python

I'm trying to store a compressed dictionary in my sqlite database. First, I convert the dict to a string using json.dumps, which seems to work fine. Storing this string in DB also works.
In the next step, I'm compressing my string using encode("zlib"). But storing the resulting string in my db throws an error.
mydict = {"house":"Haus","cat":"Katze","red":u'W\xe4yn',"dict":{"1":"asdfhgjl ahsugoh ","2":"s dhgsuoadhu gohsuohgsduohg"}}
dbCommand("create table testTable (ch1 varchar);")
# convert dictionary to string
jch1 = json.dumps(mydict,ensure_ascii=True)
print(jch1)
# store uncompressed values
dbCommand("insert into testTable (ch1) values ('%s');"%(jch1))
# compress json strings
cjch1 = jch1.encode("zlib")
print(cjch1)
# store compressed values
dbCommand("insert into testTable (ch1) values ('%s');"%(cjch1))
The first print outputs:
{"house": "Haus", "dict": {"1": "asdfhgjl ahsugoh ", "2": "s dhgsuoadhu gohsuohgsduohg"}, "red": "W\u00e4yn", "cat": "Katze"}
The second print is not readable of course:
xワフ1テPCᆵyfᅠネノ õ
Do I need to do any additional conversion before?
Looking forward to any helping hint!

Let's approach this from behind: why are you using gzip encoding in the first place? Do you think you need to save space in your database? Have you checked how long the dictionary strings will be in production? These strings will need to have a minimal length before compression will actually save storage space (for small input strings the output might even be larger than the input!). If that actually saves some disk space: did you think through whether the additional CPU load and processing time due to gzip encoding and decoding are worth the saved space?
Other than that: the result of gzip/zlib compression is a binary blob. In Python 2, this should be of type str. In Python 3, this should be type bytes. In any case, the database needs to know that whatever you are storing there is binary data! VARCHAR is not the right data type for this endeavor. What follows is a quote from MySQL docs:
Also, if you want to store binary values such as results from an
encryption or compression function that might contain arbitrary byte
values, use a BLOB column rather than a CHAR or VARCHAR column, to
avoid potential problems with trailing space removal that would change
data values.
The same consideration holds true for other databases. Also in case of SQLite you must use the BLOB data type (see docs) for storing binary data (if you want to ensure to get back the exact same data as you have put in before :-)).

Thanks a lot Jan-Philip,
you showed me the right solution. My table needs to have a BLOB entry to store the data. Here is the working code:
mydict = {"house":"Haus","cat":"Katze","red":u'W\xe4yn',"dict":{"1":"asdfhgjl ahsugoh ","2":"s dhgsuoadhu gohsuohgsduohg"}}
curs.execute("create table testTable (ch1 BLOB);")
# convert dictionary to string
jch1 = json.dumps(mydict,ensure_ascii=True)
cjch1 = jch1.encode("zlib")
# store compressed values
curs.execute('insert into testTable values (?);', [buffer(cjch1)])
db.commit()

Related

How to partially overwrite blob in an sqlite3 db in python via SQLAlchemy?

I'm having a db that contains a blob column with the binary representation as follows
The value that I'm interested in is encoded as little endian unsigned long long (8 byte) value in the marked. Reading this value works fine like this
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
id = unpack("<Q", p.value[-8:])[0]
id in the above example is 1657266.
Now what I would like to do is the reverse. I have the row object p, I have a number in decimal format (using the same 1657266 for testing purposes) and I want to write that number in little endian format to those same 8 byte.
I've been trying to do so via SQL statement
UPDATE properties SET value = (SELECT substr(value, 1, length(value)-8) || x'b249190000000000' FROM properties WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%') WHERE object_id=1817012 AND name LIKE '%OwnerUniqueID%'
But when I do it like that I then can't read it anymore. At least not with SQLAlchemy. When I try the same code as above, I get the error message Could not decode to UTF-8 column 'properties_value' with text '☻' so it looks like it's written in a different format.
Interestingly using a normal select statement in DB Browser still works fine and the blob is still displayed exactly as in the screenshot above.
Now ideally I'd like to be able to write just those 8 bytes using the SQLAlchemy ORM but I'd settle for a raw SQL statement if that's what it takes.
I managed to get it to work with SQLAlchemy by basically reversing the process that I used to read it. In hindsight using the + to concatenate and the [:-8] to slice the correct part seems pretty obvious.
p = session.query(Properties).filter((Properties.object_id==1817012) & (Properties.name.like("%OwnerUniqueID"))).one()
p.value = p.value[:-8] + pack("<Q", 1657266)
By turning on ECHO for SQLAlchemy I got the following raw SQL statement:
UPDATE properties SET value=? WHERE properties.object_id = ? AND properties.name = ?
(<memory at 0x000001B93A266A00>, 1817012, 'BP_ThrallComponent_C.OwnerUniqueID')
Which is not particularly helpful if you want to do the same thing manually I suppose.
It's worth noting that the raw SQL statement in my question not only works as far as reading it with the DB Browers is concerned but also with the game client that uses the db in question. It's only SQLAlchemy that seems to have troubles, trying to decode it as UTF-8 it seems.

insert list,values into mysql table

I need to insert the list,some values into table
I have tried executemany but it's not worked.
list1=['a','b','c','d','e',['f','g','h','i']]
query="insert into "+metadata_table_name+"(created_by,Created_on,File_Path,Category,File_name,Fields) values(%s,%s,%s,%s,%s,%s)" # inserting the new record
cursor.executemany(query,list1)
list should be entered into the last(Fileds) Column
Please help me.
Thanks in Advance.
You have to think about data types. Does MySQL have a suitable data type for python's nested lists? I don't know such types.
A possible solution is to use JSON encoding and storing the lists as strings in MySQL table. Encode last element of your list to JSON string:
import json
list1=['a','b','c','d','e',['f','g','h','i']]
query_params = list1[0:-1] + [json.dumps(list1[:-1])]
query="insert into "+metadata_table_name+" .(created_by,Created_on,File_Path,Category,File_name,Fields) values(%s,%s,%s,%s,%s,%s)" # inserting the new record
cursor.executemany(query, query_params)
For using stored data later you have to convert back JSON string to a list:
fields_list = json.loads(fields_str)

How to reduce memory footprint for dictionary with 4M+ objects with strings?

How can I reduce the memory footprint for a dictionary containing 4M+ objects with strings?
It currently consumes about 1.5 GBytes of RAM, and I need to add several million more objects to it on systems that have limited resources due to prohibitive cost (cloud-based).
Here's some simplified code illustrating the gist of what I'm doing. Basically I fetch a set of about 4 million users from a database and put all the info into a local dict with all the users for quick access (I must work with a local copy of the user data for performance reasons).
Simplified Code
import pymysql
class User:
__slots__ = ['user_id', 'name', 'type']
def __init__(self):
user_id = None
name = None
type = None
cursor.execute("SELECT UserId, Username, Type FROM Users")
db_query_result = cursor.fetchall()
all_users = {}
for db_user in db_query_result:
user_details = User()
user_details.name = db_user[1]
user_details.type = db_user[2]
db_user_id = db_user[0]
all_users[str(db_user_id)] = user_details
Data Types
user_id: int
name: string, each one averaging maybe about 13 characters
type: int
From some web searching, it seems to me User.name is consuming the majority of the space due to the large amount of memory required for string objects.
I already decreased the footprint from about 2GB down to 1.5GB by using __slots__, but I need to reduce it further.
If you really need the data locally, consider saving it to a SQLite DB on the host, and letting SQLite load the hot dataset into memory for you, instead of keeping all of it in memory.
db_conn = sqlite3.connect(path_to_sqlite_file)
db_conn.execute('PRAGMA mmap_size={};'.format(mmap_size))
If you really need all that data in memory, consider configuring swap space on the host as a cheaper alternative. The OS will swap colder memory pages to this swap space.
Of course, you can always compress your strings using gzip, if name is a large string. Other tricks include deduplication with an index, if there are repeated words in your names.
You can also use structs instead of classes.
sys.getsizeof(u) # 64 bytes
sys.getsizeof(struct.pack('HB13s', 10, 1, b'raymond')) # 49 bytes
# unsigned short for user ID, unsigned byte for type, string with 13 bytes
If you know that your user IDs are contiguous, and you are using fixed length structs, you can also lookup simple array by counting byte offsets, instead of using the dict. (Numpy arrays would be useful here.)
all_users = np.array([structs])
all_users = (struct0, struct1, struct2, ...) # good old tuples are OK too e.g. all_users[user_id] would work
For something closer to production quality, you will want to have a data prep step that appends these structs to a file, which can later be read when you are actually using the data
# writing
with open('file.dat', mode='w+') as f:
for user in users:
f.write(user) # where user is a fixed length struct
# reading
with open('file.dat', mode='r') as f:
# given some index
offset = index * length_of_struct
f.seek(offset)
struct = f.read(length_of_struct)
However, I am not convinced that this is the best design for the problem you actually have. Other alternatives include:
inspecting your db design, especially your indexes
using memcache/redis to cache the most frequently used records
A 13-character string's actual string storage takes only 13 bytes if it's all Latin-1, 26 bytes if it's all BMP, 52 bytes if it's got characters from all over Unicode.
However, the overhead for a str object is another 52 bytes. So, assuming you've got mostly Latin-1, you're using about 5x as much storage as you need.
If your strings are, once encoded to UTF-8 or UTF-16-LE or whatever is best for your data, all around the same size, you probably want to store them in a big flat array and pull them out and decode them on the fly as needed, as shown in James Lim's answer. Although I'd probably use a NumPy native structured dtype rather than use the struct module.
But what if you have a few huge strings, and you don't want to waste 88 bytes for each one when most of them are only 10 bytes long?
Then you want a string table. This is just a giant bytearray where all the (encoded) strings live, and you store indexes into that table instead of storing the strings themselves. Those indexes are just int32 or at worst int64 values that you can pack into an array with no problems.
For example, assuming none of your strings are more than 255 characters, we can store them as "Pascal strings", with a length byte followed by the encoded bytes:
class StringTable:
def __init__(self):
self._table = bytearray()
def add(self, s):
b = s.encode()
idx = len(self._table)
self._table.append(len(b))
self._table.extend(b)
return idx
def get(idx):
stop = idx + self._table[idx]
return self._table[idx+1:stop].decode()
So now:
strings = StringTable()
for db_user in db_query_result:
user_details = User()
user_details.name = strings.add(db_user[1])
user_details.type = strings.add(db_user[2])
db_user_id = strings.add(str(db_user[0]))
all_users[db_user_id] = user_details
Except, of course, you probably still want to replace that all_users with a numpy array.
Instead of using cursor.fetchall(), storing all the data in the client side, you should use an SSCursor to leave the result set on the server side:
import pymysql
import pymysql.cursors as cursors
conn = pymysql.connect(..., cursorclass=cursors.SSCursor)
so that you can fetch the rows one by one:
cursor = conn.cursor()
cursor.execute('SELECT UserId, Username, Type FROM Users')
for db_user in cursor:
user_details = User()
user_details.name = db_user[1]
user_details.type = db_user[2]
...
And depending on what you want to do with the all_users dict, you may not need to store all user info in a dict either. If you can process each user one by one, do it directly inside the for loop above instead of building up a huge dict.
Do you actually need this cached in memory, or just on the local system?
If the latter, just use a local database.
Since you just want something that acts like a dict, you just want a key-value database. The simplest KV database is a dbm, which Python supports out of the box. Using a dbm from Python looks exactly like using a dict, except that the data are on disk instead of in memory.
Unfortunately, dbm has two problems, but they're both solvable:
Depending on the underlying implementation, a huge database might either not work, or go very slowly. You can use a modern variant like KyotoCabinet to solve that, but you'll need a third-party wrapper.
dbm keys and values can only be bytes. Python's dbm module wraps things up to allow storing Unicode strings transparently, but nothing else. But Python comes with another module, shelve, which lets you transparently store any kind of value that can be pickled in a dbm.
But you might want to instead use a more powerful key-value database like Dynamo or Couchbase.
In fact, you might even be able to get away with just using a KV database like Redis or Memcached purely in-memory, because they'll store the same data you're storing a lot more compactly.
Alternatively, you could just dump the data from the remote MySQL into a local MySQL, or even a local SQLite (and optionally throw an ORM in front of it).
The memory footprint may be decreased with the help of recordclass:
from recordclass import dataobject
class User(dataobject):
__fields__ = 'user_id', 'name', 'type'
Each instance of the User now require less memory than __slots__-based one.
The difference is equal to 24 bytes (the size of PyGC_Head).

Zero date in mysql select using python

I have a python script which selects some rows from a table and insert them into another table. One field has type of date and there is a problem with its data when its value is '0000-00-00'. python converts this value to None and so gives an error while inserting it into the second table.
How can I solve this problem? Why python converts that value to None?
Thank you in advance.
This is actually a None value in the data base, in a way. MySQL treats '0000-00-00' specially.
From MySQL documentation:
MySQL permits you to store a “zero” value of '0000-00-00' as a “dummy
date.” This is in some cases more convenient than using NULL values,
and uses less data and index space. To disallow '0000-00-00', enable
the NO_ZERO_DATE mode.
It seems that Python's MySQL library is trying to be nice to you and converts this to None.
When writing, it cannot guess that you wanted '0000-00-00' and uses NULL instead. You should convert it yourself. For example, this might work:
if value_read_from_one_table is not None:
value_written_to_the_other_table = value_read_from_one_table
else:
value_written_to_the_other_table = '0000-00-00'

Loading a huge Python Pickle dictionary

I generated by pickle.dump() a file with the size of about 5GB. It takes about half a day to load this file and about 50GM RAM. My question is whether is it possible to read this file by accessing separately entry by entry (one at a time) rather than loading it all into memory, or if you have any other suggestion of how to access data in such a file.
Many thanks.
There is absolutely no question that this should be done using a database, rather than pickle- databases are designed for exactly this kind of problem.
Here is some code to get you started, which puts a dictionary into a sqllite database and shows an example of retrieving a value. To get this to work with your actual dictionary rather than my toy example, you'll need to learn more about SQL, but fortunately there are many excellent resources available online. In particular, you might want to learn how to use SQLAlchemy, which is an "Object Relational Mapper" that can make working with databases as intuitive as working with objects.
import os
import sqlite3
# an enormous dictionary too big to be stored in pickle
my_huge_dictionary = {"A": 1, "B": 2, "C": 3, "D": 4}
# create a database in the file my.db
conn = sqlite3.connect('my.db')
c = conn.cursor()
# Create table with two columns: k and v (for key and value). Here your key
# is assumed to be a string of length 10 or less, and your value is assumed
# to be an integer. I'm sure this is NOT the structure of your dictionary;
# you'll have to read into SQL data types
c.execute("""
create table dictionary (
k char[10] NOT NULL,
v integer NOT NULL,
PRIMARY KEY (k))
""")
# dump your enormous dictionary into a database. This will take a while for
# your large dictionary, but you should do it only once, and then in the future
# make changes to your database rather than to a pickled file.
for k, v in my_huge_dictionary.items():
c.execute("insert into dictionary VALUES ('%s', %d)" % (k, v))
# retrieve a value from the database
my_key = "A"
c.execute("select v from dictionary where k == '%s'" % my_key)
my_value = c.next()[0]
print my_value
Good luck!
You could try an object oriented database, if your data is heterogeneous - using ZODB - which internally uses pickle, but in a way designed, and time proved - to manage large amounts of data, you probably will need little changes to your application
ZODB is the heart of Zope - a Python application server - which today powers Plone among other applications.
It can be used stand-alone, without all of Zope's tools - you should check it, doubly so if your data is not fit for SQL.
http://www.zodb.org/

Categories