I'm looking for solution to store huge (to big for RAM) array-like/list-like object on HDD. So, basically I'm looking for key-value database with:
- integer (not string!) keys
- ability to store python objects (list of tuples). Appended object will never be changed. There is no relation betweeen objects in array.
- low memory usage (no caching). If I need to load 35235235th object, I want to load only it.
So, I could use SQLite and blobs, but I'm looking for something more elegant and very fast.
Sorry for my bad english. I'm using Python 3.
You could have a look at zodb http://www.zodb.org, which is a mature python db project. It has the btrees package, which provides the IOBTree, which is what you might be looking for.
Related
I have some code that I am working on that scrapes some data from a website, and then extracts certain key information from that website and stores it in an object. I create a couple hundred of these objects each day, each from unique url's. This is working quite well, however, I'm inexperienced in what options are available to me in Python for persistence and what would be best suited for my needs.
Currently I am using pickle. To do so, I am keeping all of these webpage objects and appending them in a list as new ones are created, then saving that list to a pickle (then reloading it whenever the list is to be updated). However, as i'm in the order of some GB of data, i'm finding pickle to be somewhat slow. It's not unworkable, but I'm wondering if there is a more well suited alternative. I don't really want to break apart the structure of my objects and store it in a sql type database, as its important for me to keep the methods and the data as a single object.
Shelve is one option I've been looking into, as my impression is then that I wouldn't have to unpickle and pickle all the old entries (just the most recent day that needs to be updated), but am unsure if this is how shelve works, and how fast it is.
So to avoid rambling on, my question is: what is the preferred persistence method for storing a large number of objects (all of the same type), to keep read/write speed up as the collection grows?
Martijn's suggestion could be one of the alternatives.
You may consider to store the pickle objects directly in a sqlite database which still can manage from the python standard library.
Use a StringIO object to convert between the database column and python object.
You didn't mention the size of each object you are pickling now. I guess it should stay well within sqlite's limit.
The python shelf module only seems to be allow string keys. Is there a way to use arbitrary typed (e.g. numeric) keys? May be an sqlite-backed dictionary?
Thanks!
Why not convert your keys to strings? Numeric keys should be pretty easy to do this with.
You can serialize on the fly (via pickle or cPickle, like shelve.py does) every key, as well as every value. It's not really worth subclassing shelve.Shelf since you'd have to subclass almost every method -- for once, I'd instead recommend copying shelve.py into your own module and editing it to suit. That's basically like coding your new module from scratch but you get a working example to show you the structure and guidelines;-).
sqlite has no real advantage in a sufficiently general case (where the keys could be e.g. arbitrary tuples, of different arity and types for every entry) -- you're going to have to serialize the keys anyway to make them homogeneous. Still, nothing stops you from using sqlite, e.g. to keep several "generalized shelves" into a single file (different tables of the same sqlite DB) -- if you care about performance you should measure it each way, though.
I think you want to overload the [] operator. You can do it by defining the __getitem__ method.
I ended up subclassing the DbfilenameShelf from the shelve-module. I made a shelf which automatically converts non-string-keys into string-keys and returns them in original form when queried. It works well for Python's standard immutable objects: int, float, string, tuple, boolean.
It can be found in: https://github.com/North-Guard/simple_shelve
What is the most efficient method to store a Python dictionary on the disk? The only methods I know of right now are plain-text and the pickle module.
Edit: Sorry for not being very clear. By efficient I meant fastest execution speed. The dictionary will contain mutable objects that will hold information to be parsed and modified.
shelve is pretty nice as well
or this persistent dictionary recipe
for a convenient method that keeps your objects synchronized with storage, there's the ORM SQLAlchemy for python
if you just need a way to store string values by some key, theres the dbm.ndbm and dbm.gnu modules.
if you need a hyper efficient, distributed key value cache, something like memcached for python...
JSON and YAML work well, also.
Depends on what you mean by "efficient"? Size of file? Time required? Amount of code you need to write?
You have the timeit module available to determine what meets your specific criteria for "efficient".
I am building an application to distribute to fellow academics. The application will take three parameters that the user submits and output a list of dates and codes related to those events. I have been building this using a dictionary and intended to build the application so that the dictionary loaded from a pickle file when the application called for it. The parameters supplied by the user will be used to lookup the needed output.
I selected this structure because I have gotten pretty comfortable with dictionaries and pickle files and I see this going out the door with the smallest learning curve on my part. There might be as many as two million keys in the dictionary. I have been satisfied with the performance on my machine with a reasonable subset. I have already thought through about how to break the dictionary apart if I have any performance concerns when the whole thing is put together. I am not really that worried about the amount of disk space on their machine as we are working with terabyte storage values.
Having said all of that I have been poking around in the docs and am wondering if I need to invest some time to learn and implement an alternative data storage file. The only reason I can think of is if there is an alternative that could increase the lookup speed by a factor of three to five or more.
The standard shelve module will give you a persistent dictionary that is stored in a dbm style database. Providing that your keys are strings and your values are picklable (since you're using pickle already, this must be true), this could be a better solution that simply storing the entire dictionary in a single pickle.
Example:
>>> import shelve
>>> d = shelve.open('mydb')
>>> d['key1'] = 12345
>>> d['key2'] = value2
>>> print d['key1']
12345
>>> d.close()
I'd also recommend Durus, but that requires some extra learning on your part. It'll let you create a PersistentDictionary. From memory, keys can be any pickleable object.
To get fast lookups, use the standard Python dbm module (see http://docs.python.org/library/dbm.html) to build your database file, and do lookups in it. The dbm file format may not be cross-platform, so you may want to to distrubute your data in Pickle or repr or JSON or YAML or XML format, and build the dbm database the user runs your program.
How much memory can your application reasonably use? Is this going to be running on each user's desktop, or will there just be one deployment somewhere?
A python dictionary in memory can certainly cope with two million keys. You say that you've got a subset of the data; do you have the whole lot? Maybe you should throw the full dataset at it and see whether it copes.
I just tested creating a two million record dictionary; the total memory usage for the process came in at about 200MB. If speed is your primary concern and you've got the RAM to spare, you're probably not going to do better than an in-memory python dictionary.
See this solution at SourceForge, esp. the "endnotes" documentation:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
Here are three things you can try:
Compress the pickled dictionary with zlib. pickle.dumps(dict).encode("zlib")
Make your own serializing format (shouldn't be too hard).
Load the data in a sqlite database.
I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.
With a little bit of thought it seems like you could get the shelve module to do what you want.
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
For simple use cases sqlitedict
can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.
https://vaex.io/docs/tutorial.html#Getting-your-data-in