How to update a saved mapping without loading it into memory? - python

I'm maintaining some mappings that I need to continually update.
These mappings are implemented as pickle serialized dicts right now.
The update process is like this:
Load the pickle file into memory, so that I have access to the dict
Do any update to the dict and serialize it again.
The problem with this solution is it could consume a lot of memory for large dicts.
I've looked into other solutions like shelve and leveldb, but they could both generates many files instead of one, which is more complex to save to systems like key-value storage.

To read and modify your mappings without reading the entire map into memory, you'll need to store it as an indexed structure in some sort of database. There are lots of databases with good Python bindings that store the data on disk as a file, so that you don't have to worry about database servers or separate index files. Sqlite is almost certainly the most common choice. However, as you pointed out in the comments, the full functionality of an SQL database is probably unnecessary for your purpose, since you really only need to store key-value pairs.
Based on your particular requirements, then, I'd probably recommend vedis. It's a single-file, key-value database which can support very large database sizes (the documentation claims it can handle on the order of terrabytes), which is transactional and thread-safe to boot.

Related

Caching a large data structure in python across instances while maintaining types

I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?
One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.
Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:
Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object.
A String value can be at max 512 Megabytes in length.
Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.
Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

What's the best strategy for dumping very large python dictionaries to a database?

I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)

Can you permanently change python code by input?

I'm still learning python and am currently developing an API (artificial personal assistant e.g. Siri or Cortana). I was wondering if there was a way to update code by input. For example, if I had a list- would it be possible to PERMANENTLY add a new item even after the program has finished running.
I read that you would have to use SQLite, is that true? And are there any other ways?
Hello J Nowak
I think what you want to do is save the input data to a file (Eg. txt file).
You can view the link below which will show you how to read and write to a text file.
How to read and write to text file in Python
There are planty of methods how you can make your data persistent.
It depends on the task, on the environment etc.
Just a couple examples:
Files (JSON, DBM, Pickle)
NoSQL Databases (Redis, MongoDB, etc.)
SQL Databases (both serverless and client server: sqlite, MySQL, PostgreSQL etc.)
The most simple/basic approach is to use files.
There are even modules that allow to do it transparently.
You just work with your data as always.
See shelve for example.
From the documentation:
A “shelf” is a persistent, dictionary-like object. The difference with
“dbm” databases is that the values (not the keys!) in a shelf can be
essentially arbitrary Python objects — anything that the pickle module
can handle. This includes most class instances, recursive data types,
and objects containing lots of shared sub-objects. The keys are
ordinary strings.
Example of usage:
import shelve
s = shelve.open('test_shelf.db')
try:
s['key1'] = { 'int': 10, 'float':9.5, 'string':'Sample data' }
finally:
s.close()
You work with s just normally, as it were just a normal dictionary.
And it is automatically saved on disk (in file test_shelf.db in this case).
In this case your dictionary is persistent
and will not lose its values after the program restart.
More on it:
https://docs.python.org/2/library/shelve.html
https://pymotw.com/2/shelve/
Another option is to use pickle, which gives you persistence
also, but not magically: you will need read and write data on your own.
Comparison between shelve and pickle:
What is the difference between pickle and shelve?

Moving back and forth between an on-disk database and a fast in-memory database?

Python's sqlite3 :memory: option provides speedier queries and updates than the equivalent on-disk database. How can I load a disk-based database into memory, do fast operations on it, and then write the updated version back to disk?
The question How to browse an in memory sqlite database in python seems related but it focuses on how to use a disk-based browsing tool on an in-memory db. The question How can I copy an in-memory SQLite database to another in-memory SQLite database in Python? is also related but it is specific to Django.
My current solution is to read all of the tables, one-at-a-time, from the disk-based database into lists of tuples, then manually recreate the entire database schema for the in-memory db, and then load the data from the lists of tuples into the in-memory db. After operating on the data, the process is reversed.
There must be a better way!
The answer at How to load existing db file to memory in Python sqlite3? provided the important clues. Building on that answer, here is a simplification and generalization of that code.
It eliminates eliminate the unnecessary use of StringIO and is packaged into reusable form for both reading into and writing from an in-memory database.
import sqlite3
def copy_database(source_connection, dest_dbname=':memory:'):
'''Return a connection to a new copy of an existing database.
Raises an sqlite3.OperationalError if the destination already exists.
'''
script = ''.join(source_connection.iterdump())
dest_conn = sqlite3.connect(dest_dbname)
dest_conn.executescript(script)
return dest_conn
if __name__ == '__main__':
from contextlib import closing
with closing(sqlite3.connect('pepsearch.db')) as disk_db:
mem_db = copy_database(disk_db)
mem_db.execute('DELETE FROM documents WHERE uri="pep-3154"')
mem_db.commit()
copy_database(mem_db, 'changed.db').close()
Frankly, I wouldn't fool around too much with in-memory databases, unless you really do need an indexed structure that you know will always fit entirely within available memory. SQLite is extremely smart about its I/O, especially when you wrap everything (including reads ...) into transactions, as you should. It will very efficiently keep things in memory as it is manipulating data structures that fundamentally live on external storage, and yet it will never exhaust memory (nor, take too much of it). I think that RAM really does work better as "a buffer" instead of being the primary place where data is stored ... especially in a virtual storage environment, where everything must be considered as "backed by external storage anyway."

Accessing items from a dictionary using pickle efficiently in Python

I have a large dictionary mapping keys (which are strings) to objects. I pickled this large dictionary and at certain times I want to pull out only a handful of entries from it. The dictionary has usually thousands of entries total. When I load the dictionary using pickle, as follows:
from cPickle import *
# my dictionary from pickle, containing thousands of entries
mydict = open(load('mypickle.pickle'))
# accessing only handful of entries here
for entry in relevant_entries:
# find relevant entry
value = mydict[entry]
I notice that it can take up to 3-4 seconds to load the entire pickle, which I don't need, since I access only a tiny subset of the dictionary entries later on (shown above.)
How can I make it so pickle only loads those entries that I have from the dictionary, to make this faster?
Thanks.
Pickle serializes object (hierachies), it's not an on-disk store. As you have seen, you must unpickle the entire object to use it - which is of course wasteful. Use shelve, dbm or a database (SQLite) for on-disk storage.
You'll have to have "Ghost" objects, I.e. objects that are only placeholders and load themselves when accessed. This is a Difficult Issue, but it has been solved. You have two options. You can use the persistence library from ZODB, that helps with this. Or, you just start using ZODB directly; problem solved.
http://www.zodb.org/
If your objects are independent of each others, you could pickle and unpickle them individually using their key as filename, in some perverse way a directory is a kind of dictionary mapping filenames to files. This way it is simple to load only relevant entries.
Basically you use a memory dictionary as cache and if the searched key is missing try to load the file from the filesystem.
I'm not really saying you should do that. A database (ZODB, SQLite, other) is probably better for persistant storage.

Categories