Low memory and fastest querying database for a Python project

Low memory and fastest querying database for a Python project - python

I'm migrating a GAE/Java app to Python (non-GAE) due new pricing, so I'm getting a little server and I would like to find a database that fits the following requirements:
Low memory usage (or to be tuneable or predictible)
Fastest querying capability for simple document/tree-like data identified by key (I don't care about performance on writing and I assume it will have indexes)
Bindings with Pypy 1.6 compatibility (or Python 2.7 at least)
My data goes something like this:
Id: short key string
Title
Creators: an array of another data structure which has an id - used as key -, a name, a site address, etc.
Tags: array of tags. Each of them can has multiple parent tags, a name, an id too, etc.
License: a data structure which describes its license (CC, GPL, ... you say it) with name, associated URL, etc.
Addition time: when it was add in our site.
Translations: pointers to other entries that are translations of one creation.
My queries are very simple. Usual cases are:
Filter by tag ordered by addition time.
Select a few (pagination) ordered by addition time.
(Maybe, not done already) filter by creator.
(Not done but planned) some autocomplete features in forms, so I'm going to need search if some fields contains a substring ('LIKE' queries).
The data volume is not big. Right now I have about 50MB of data but I'm planning to have a huge dataset around 10GB.
Also, I want to rebuild this from scratch, so I'm open to any option. What database do you think can meet my requirements?
Edit: I want to do some benchmarks around different options and share the results. I have selected, so far, MongoDB, PostgreSQL, MySQL, Drizzle, Riak and Kyoto Cabinet.

The path of least resistance for migrating an app engine application will probably be using AppScale, which implements a major portion of the app engine API. In particular, you might want to use the HyperTable data-store, which closely mirrors the Google App Engine datastore.
Edit: ok, so you're going for a redesign. I'd like to go over some of the points you make in your question.
Low memory usage
That's pretty much the opposite of what you want in a database; You want as much of your dataset in core memory as is possible; This could mean tuning the dataset itself to fit efficiently, or adding memcached nodes so that you can spread the dataset across several hosts so that each host has a small enough fraction of the dataset that it fits in core.
To drive this point home, consider that reading a value from ram is about 1000 times faster than reading it from disk; A database that can satisfy every query from core can handle 10 times the workload compared with a database that has to visit the disk for just 1% of its queries.
I'm planning to have a huge dataset around 10GB.
I don't think that you could call 10GB a 'huge dataset'. In fact, that's something that could probably fit in the ram of a reasonably large database server; You wouldn't need more than one memcached node, much less additional persistance nodes (typical disk sizes are in the Terabytes, 100 times larger than this expected dataset.
Based on this information, I would definitely advise using a mature database product like PostgreSQL, which would give you plenty of performance for the data you're describing, easily provides all of the features you're talking about. If the time comes that you need to scale past what PostgreSQL can actually provide, you'll actually have a real workload to analyse to know what the bottlenecks really are.

I would recommend Postresql, only because it does what you want, can scale, is fast, rather easy to work with and stable.
It is exceptionally fast at the example queries given, and could be even faster with document querying.

Related

Python Psych Experiment needs (simple) database: please advise

I am coding a psychology experiment in Python. I need to store user information and scores somewhere, and I need it to work as a web application (and be secure).
Don't know much about this - I'm considering XML databases, BerkleyDB, sqlite, an openoffice spreadsheet, or I'm very interested in the python "shelve" library.
(most of my info coming from this thread: http://developers.slashdot.org/story/08/05/20/2150246/FOSS-Flat-File-Database
DATA: I figure that I'm going to have maximally 1000 users. For each user I've got to store...
Username / Pass
User detail fields (for a simple profile)
User scores on the exercise (2 datapoints: each trial gets a score (correct/incorrect/timeout, and has an associated number from 0.1 to 1.0 that I need to record)
Metadata about the trials (when, who, etc.)
Results of data analysis for user
VERY rough estimate, each user generates 100 trials / day. So maximum of 10k datapoints / day. It needs to run that way for about 3 months, so about 1m datapoints. Safety multiplier 2x gives me a target of a database that can handle 2m datapoints.
((note: I could either store trial response data as individual data points, or group trials into Python list objects of varying length (user "sessions"). The latter would dramatically bring down the number database entries, though not the amount of data. Does it matter? How?))
I want a solution that will work (at least) until I get to this 1000 users level. If my program is popular beyond that level, I'm alright with doing some work modding in a beefier DB. Also reiterating that it must be easily deployable as a web application.
Beyond those basic requirements, I just want the easiest thing that will make this work. I'm pretty green.
Thanks for reading
Tr3y

SQLite can certainly handle those amount of data, it has a very large userbase with a few very well known users on all the major platforms, it's fast, light, and there are awesome GUI clients that allows you to browse and extract/filter data with a few clicks.
SQLite won't scale indefinitely, of course, but severe performance problems begins only when simultaneous inserts are needed, which I would guess is a problem appearing several orders of magnitude after your prospected load.
I'm using it since a few years now, and I never had a problem with it (although for larger sites I use MySQL). Personally I find that "Small. Fast. Reliable. Choose any three." (which is the tagline on SQLite's site) is quite accurate.
As for the ease of use... SQLite3 bindings (site temporarily down) are part of the python standard library. Here you can find a small tutorial. Interestingly enough, simplicity is a design criterion for SQLite. From here:
Many people like SQLite because it is small and fast. But those qualities are just happy accidents. Users also find that SQLite is very reliable. Reliability is a consequence of simplicity. With less complication, there is less to go wrong. So, yes, SQLite is small, fast, and reliable, but first and foremost, SQLite strives to be simple.

There's a pretty spot-on discussion of when to use SQLite here. My favorite line is this:
Another way to look at SQLite is this: SQLite is not designed to replace Oracle. It is designed to replace fopen().
It seems to me that for your needs, SQLite is perfect. Indeed, it seems to me very possible that you will never need anything else:
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (2^41 bytes).
It doesn't sound like you'll have that much data at any point.

I would consider MongoDB. It's very easy to get started, and is built for multi-user setups (unlike SQLite).
It also has a much simpler model. Instead of futzing around with tables and fields, you simply take all the data in your form and stuff it in the database. Even if your form changes (oops, forgot a field) you won't need to change MongoDB.

Storing data for web application in python dictionary

Is it feasible to store data for a web application inside the program itself, e.g. as a large dictionary? The data would mostly be just a few hundred short-ish text blocks (roughly blog post size), and it will not be altered/added to at all by the users (although I would want to be able to update it myself every so often).
Up until now I've been looking at standard database storage solutions, but since (for this set of objects at least) they will not be modified, is it possible simply to store them as a dictionary? Or are there serious downsides I have not considered?
Thanks.

It is certainly possible. The amount of data you can store will be limited mostly by memory available.
If you are planning to perform database-like operations on the data then you are better off with an in-memory database like SQLite. If the data is picked up from a database and hashed then you might want to use Memcached. If you are limiting yourself to plain text and the data won't grow then you can stick to memory.
The approach will start losing its charm if the data will grow in the future. In that case you are better off with other solutions.

Memory issues: Should I be writing to file/database if I'm using swap? (Python)

I'm creating and processing a very large data set, with about 34 million data points, and I'm currently storing them in python dictionaries in memory (about 22,500 dictionaries, with 15 dictionaries in each of 1588 class instances). While I'm able to manage this all in memory, I'm using up all of my RAM and most of my swap.
I need to be able to first generate all of this data, and then do analysis on select portions of it at a time. Would it be beneficial from an efficiency standpoint to write some of this data to file, or store it in a database? Or am I better off just taking the hit to efficiency that comes with using my swap space. If I should be writing to file/a database, are there any python tools that you would recommend to do so?

Get a relational database, fast! Or a whole lot more RAM.
If you're using Python, then start with Python Database Programming. SQLite would be a choice, but I'd suggest MySQL based upon the amount of data you're dealing with. If you want an object-oriented approach to storing your data, you might want to look at SQLAlchemy, but you'll probably get more efficiency if you end up mapping each of your object classes to a table yourself and just coping with rows and columns.

Because you will be looking at "select portions", your application will be able to make better use of core than Virtual Memory will. VM is convenient, but - by definition - kinda stupid about locality of reference.
Use a database.
I'd probably start with module sqlite3 on the basis of simplicity, unless or until I find that it is a bottlenck.

If you have this data in Python data structures already, assuming you're not doing a lot of in-memory indexing (more than the obvious dictionary keys index), you really don't want to use a relational database - you'll pay a considerable performance penalty for no particular benefit.
You just need to get your already key-value-pair data out of memory, not change its' format. You should look into key-value stores like BDB, Voldemort, MongoDB, or Scalaris (just to name a few - some more involved and functional than others, but all should easily handle your dataset), or for a dataset that you think might grow even larger or more complex you can look into systems like Cassandra, Riak, or CouchDB (among others). ALL of these systems will offer you vastly superior performance to a relational database and more directly map to an in-memory data model.
All that being said, of course, if your dataset really could be more performant by leveraging the benefits of a relational database (complex relationships, multiple views, etc.), then go for it, but you shouldn't use a relational database if all you're trying to do is get your data structures out of memory.
(It's also possible that just marshaling/pickling your data in segments and managing it yourself would offer better performance than a relational database, assuming your access pattern made paging in/out a relatively infrequent event. It's a long shot, but if you're just holding old data around and no one really looks at it, you might as well just throw that to disk yourself.)

Best DataMining Database

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!

Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.

You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.

Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.

It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.

Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.

Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.

It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.

If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)

Take a look at mongodb.

Fast, searchable dict storage for Python

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?

Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).

It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.

Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.

I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.