Fast, searchable dict storage for Python - python

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?

Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).

It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.

Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.

I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

Related

How to store a dynamic python dictionary in MySQL database?

I am doing a mini-project on Web-Crawler+Search-Engine. I already know how to scrape data using Scrapy framework. Now I want to do indexing. For that I figured out Python dictionary is the best option for me. I want mapping to be like name/title of an object (a string) -> the object itself (a Python object).
Now the problem is that I don't know how to store dynamic dict in MySQL database and I definitely want to store this dict as it is!
Some commands on how to go about doing that would be very much appreciated!
As others already pointed out, a NoSQL solution would be more natural in this case. And since we are talking about schemaless dictionaries - a JSON document database such as MongoDB would be a good fit.
There is a scrapy-mongodb package that provides a pipeline into MongoDB database.
If you want to store dynamic data in a database, here are a few options. It really depends on what you need out of this.
First, you could go with a NoSQL solution, like MongoDB. NoSQL allows you to store unstructured data in a database without an explicit data schema. It's a pretty big topic, with far better guides/information than I could provide you. NoSQL may not be suited to the rest of your project, though.
Second, if possible, you could switch to PostgreSQL, and use it's HSTORE column (unavailable in MySQL). The HSTORE column is designed to store a bunch of Key/Value pairs. This column types supports BTREE, GIST, GIN, and HASH indexing. You're going to need to ensure you're familiar with PostgreSQL, and how it differs from MySQL. Some of your other SQL may no longer work as you'd expect.
Third, you can serialize the data, then store the serialized entity. Both json and pickle come to mind. The viability and reliability of this will of course depend on how complicated your dictionaries are. Serializing data, especially with pickle can be dangerous, so ensure you're familiar with how it works from a security perspective.
Fourth, use an "Entity-Attribute-Value" table. This mimics a dictionaries "Key/Value" pairing. You, essentially, create a new table with three columns of "Related_Object_ID", "Attribute", "Value". You lose a lot of object metadata you'd normally get in a table, and SQL queries can become much more complicated.
Any of these options can be a double edged sword. Make sure you've read up on the downfalls of whatever option you want to go with, or, in looking into the options more, perhaps you'll find something that better suits you and your project.

Searchable database-like structure for python objects?

I am looking for a database-like structure which can hold python objects, each of which has various fields which can be searched. Some searching yields ZODB but I'm not sure that's what i want.
To explain, I have objects which can be written/read from disk in a given file format. I would like a way to organize and search many sets of those objects. Currently, I am storing them in nested dictionaries, which are populated based on the structure of the file-system and file names.
I would like a more data-base like approach, but i don't think i need a database. i would like to be able to save the structure to disk, but don't want to interface a server or anything like that.
perhaps i just want to use numpy's structured arrays? http://docs.scipy.org/doc/numpy/user/basics.rec.html
Python comes with a database... http://docs.python.org/library/sqlite3.html
I think what you are looking for is python shelve if you are just looking to save the data to disk. It is also almost 100% compliant with the dict interface so you will have to change very little code.
ZODB will do what you are looking for. You can have your data structured there exactly as you have it already. Instead of using builtin dicts, you'll have to switch to persistent mappings, and if they tend to be large, you maybe should start with btrees right from the beginning. Here is some good starting point for further reading with simple examples and tutorials. As you'll see, ZODB doesn't provide some SQL-like access language. But as it looks like you already have structured your data to fit into some dictionaries, you probably won't miss that anyways. ZODB compared to plain shelve also offers to zip your data on the fly before writing it to disk. In general, it provides a lot more options than shelve, and shows now significant downside compared to that, so I can just warmly recommend it. I switched to it a year ago, because typical relational DBs together with ORM performed so horribly slow especially for insert-or-update operations. With ZODB I'm about an order of magnitude faster here, as there is no communication overhead between the application and the database, as the database is part of your application already.

Can a Python list, set or dictionary be implemented invisibly using a database?

The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.
Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?
Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?
How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?
I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.
I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.
It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.
In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.
You are describing my dreams! However, I think you cannot do it. I always wanted something just like LINQ for Python but the language does not permit to use Python syntax for native database operations AFAIK. If it would be possible, you could just write code using lists and then use the same code for retrieving data from a database.
I would not recommend you to write a lot of code based only in lists and sets because it will not be easy to migrate it to a scalable platform. I recommend you to use something like an ORM. GAE even has its own ORM-like system and you can use other ones such as SQLAlchemy and SQLObject with e.g. SQLite.
Unfortunately, you cannot use awesome stuff such as list comprehensions to filter data from the database. Surely, you can filter data after it was gotten from the DB but you'll still need to build a query with some SQL-like language for querying objects or return a lot of objects from a database.
OTOH, there is Buzhug, a curious non-relational database system written in Python which allows the use of natural Python syntax. I have never used it and I do not know if it is scalable so I would not put my money on it. However, you can test it and see if it can help you.
You can use ORM: Object Relational Mapping: A class gets a table, an objects gets a row. I like the Django ORM. You can use it for non-web apps, too. I never used it on GAE, but I think it is possible.

Is there a B-Tree Database or framework in Python?

I heard that B-Tree databases are faster than Hash tables, so I thought of using a B-Tree database for my project. Is there any existing framework in python which allows us to use such Data structure or will I have to code from scratch?
The only reason to choose a B-Tree over a hash table, either in memory or with block storage (as in a database) is to support queries other than equal. A b-tree permits you perform range queries with good performance. Many key-value stores (such as berkley db) don't make this externally visible, though, because they still hash the keys, but this still lets you iterate over the whole dataset quickly and stably (iterators remain valid even if there are adds or deletes, or the tree must be rebalanced).
If you don't need range queries, and you don't need concurrent iteration, then you don't need b-trees, use a hash table, it will be faster at any scale.
Edit: I've had occasion for the above to actually be true; for this, the blist package seems to be the most complete implementation of a sorted container library.
you should really check out zodb.
http://www.zodb.org/en/latest/
i made a monography about it long go, though its in spanish http://sourceforge.net/projects/banta/files/Labs/zodb/Monografia%20-%20ZODB.pdf/download
Information in english is all over the place.
Program what you are trying to do first, then optimize if needed. Period.
EDIT:
http://pypi.python.org/pypi/blist
Drop in replacement for python's built in list.
SQLite3 uses B+ Trees internally, but it sounds like you may want a key-value store. Try Berkeley DB for that. If you don't need transactions, try HDF5. If you want a distributed key-value store, there is also http://scalien.com/keyspace/, but that is a server-client type system which would open up all sorts of NoSQL key-value stores.
All of these systems will be O(log(n)) for insertion and retrieval, so they will probably be slower than the hash tables that you're currently using.
Kyoto Cabinet offers a hash tree, so that may be more of what you're looking at since it should be O(1) for insertion and retrieval, but you can't do in-order traversal if you need that (although since you're currently using hash trees, this shouldn't be an issue).
http://fallabs.com/kyotocabinet/
If you're looking for performance, you will need to have the speed critical items implemented in a compiled language and then have a wrapper API in Python.
Here there is a good btree pure python implementation. You can adapt it if needed.
You might want to have a look at mxBeeBase which is part of the eGenix mx Base Distribution. It includes a fast on-disk B+Tree implementation and provides storage classes which allow building on-disk dictionaries or databases in Python.

Memory issues: Should I be writing to file/database if I'm using swap? (Python)

I'm creating and processing a very large data set, with about 34 million data points, and I'm currently storing them in python dictionaries in memory (about 22,500 dictionaries, with 15 dictionaries in each of 1588 class instances). While I'm able to manage this all in memory, I'm using up all of my RAM and most of my swap.
I need to be able to first generate all of this data, and then do analysis on select portions of it at a time. Would it be beneficial from an efficiency standpoint to write some of this data to file, or store it in a database? Or am I better off just taking the hit to efficiency that comes with using my swap space. If I should be writing to file/a database, are there any python tools that you would recommend to do so?
Get a relational database, fast! Or a whole lot more RAM.
If you're using Python, then start with Python Database Programming. SQLite would be a choice, but I'd suggest MySQL based upon the amount of data you're dealing with. If you want an object-oriented approach to storing your data, you might want to look at SQLAlchemy, but you'll probably get more efficiency if you end up mapping each of your object classes to a table yourself and just coping with rows and columns.
Because you will be looking at "select portions", your application will be able to make better use of core than Virtual Memory will. VM is convenient, but - by definition - kinda stupid about locality of reference.
Use a database.
I'd probably start with module sqlite3 on the basis of simplicity, unless or until I find that it is a bottlenck.
If you have this data in Python data structures already, assuming you're not doing a lot of in-memory indexing (more than the obvious dictionary keys index), you really don't want to use a relational database - you'll pay a considerable performance penalty for no particular benefit.
You just need to get your already key-value-pair data out of memory, not change its' format. You should look into key-value stores like BDB, Voldemort, MongoDB, or Scalaris (just to name a few - some more involved and functional than others, but all should easily handle your dataset), or for a dataset that you think might grow even larger or more complex you can look into systems like Cassandra, Riak, or CouchDB (among others). ALL of these systems will offer you vastly superior performance to a relational database and more directly map to an in-memory data model.
All that being said, of course, if your dataset really could be more performant by leveraging the benefits of a relational database (complex relationships, multiple views, etc.), then go for it, but you shouldn't use a relational database if all you're trying to do is get your data structures out of memory.
(It's also possible that just marshaling/pickling your data in segments and managing it yourself would offer better performance than a relational database, assuming your access pattern made paging in/out a relatively infrequent event. It's a long shot, but if you're just holding old data around and no one really looks at it, you might as well just throw that to disk yourself.)

Categories