I am coding a psychology experiment in Python. I need to store user information and scores somewhere, and I need it to work as a web application (and be secure).
Don't know much about this - I'm considering XML databases, BerkleyDB, sqlite, an openoffice spreadsheet, or I'm very interested in the python "shelve" library.
(most of my info coming from this thread: http://developers.slashdot.org/story/08/05/20/2150246/FOSS-Flat-File-Database
DATA: I figure that I'm going to have maximally 1000 users. For each user I've got to store...
Username / Pass
User detail fields (for a simple profile)
User scores on the exercise (2 datapoints: each trial gets a score (correct/incorrect/timeout, and has an associated number from 0.1 to 1.0 that I need to record)
Metadata about the trials (when, who, etc.)
Results of data analysis for user
VERY rough estimate, each user generates 100 trials / day. So maximum of 10k datapoints / day. It needs to run that way for about 3 months, so about 1m datapoints. Safety multiplier 2x gives me a target of a database that can handle 2m datapoints.
((note: I could either store trial response data as individual data points, or group trials into Python list objects of varying length (user "sessions"). The latter would dramatically bring down the number database entries, though not the amount of data. Does it matter? How?))
I want a solution that will work (at least) until I get to this 1000 users level. If my program is popular beyond that level, I'm alright with doing some work modding in a beefier DB. Also reiterating that it must be easily deployable as a web application.
Beyond those basic requirements, I just want the easiest thing that will make this work. I'm pretty green.
Thanks for reading
Tr3y
SQLite can certainly handle those amount of data, it has a very large userbase with a few very well known users on all the major platforms, it's fast, light, and there are awesome GUI clients that allows you to browse and extract/filter data with a few clicks.
SQLite won't scale indefinitely, of course, but severe performance problems begins only when simultaneous inserts are needed, which I would guess is a problem appearing several orders of magnitude after your prospected load.
I'm using it since a few years now, and I never had a problem with it (although for larger sites I use MySQL). Personally I find that "Small. Fast. Reliable. Choose any three." (which is the tagline on SQLite's site) is quite accurate.
As for the ease of use... SQLite3 bindings (site temporarily down) are part of the python standard library. Here you can find a small tutorial. Interestingly enough, simplicity is a design criterion for SQLite. From here:
Many people like SQLite because it is small and fast. But those qualities are just happy accidents. Users also find that SQLite is very reliable. Reliability is a consequence of simplicity. With less complication, there is less to go wrong. So, yes, SQLite is small, fast, and reliable, but first and foremost, SQLite strives to be simple.
There's a pretty spot-on discussion of when to use SQLite here. My favorite line is this:
Another way to look at SQLite is this: SQLite is not designed to replace Oracle. It is designed to replace fopen().
It seems to me that for your needs, SQLite is perfect. Indeed, it seems to me very possible that you will never need anything else:
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (2^41 bytes).
It doesn't sound like you'll have that much data at any point.
I would consider MongoDB. It's very easy to get started, and is built for multi-user setups (unlike SQLite).
It also has a much simpler model. Instead of futzing around with tables and fields, you simply take all the data in your form and stuff it in the database. Even if your form changes (oops, forgot a field) you won't need to change MongoDB.
Related
I'm writing a web application in python and postgreSQL. Users are to access a lot of information during a session. All such information (almost) are indexed in the database. My question is, should I litter the code with specific queries, or is it better practice to query larger chunks of information, cashing it, and letting python process the chunk for finer pieces?
For example: A user is to ask for entries in a payment log. Either one writes a query asking for the specific entries requested, or one collect the payment history of the user and then use python to select the specific entries.
Of course cashing is preferred when working with heavy queries, but since nearly all my data is indexed, direct database access is fast and the cashing approach would not yield much if any extra speed. But are there other factors that may still render the cashing approach preferable?
Database designers spend a lot of time on caching and optimization. Unless you hit a specific problem, it's probably better to let the database do the database stuff, and your code do the rest instead of having your code try to take over some of the database functionality.
I'm migrating a GAE/Java app to Python (non-GAE) due new pricing, so I'm getting a little server and I would like to find a database that fits the following requirements:
Low memory usage (or to be tuneable or predictible)
Fastest querying capability for simple document/tree-like data identified by key (I don't care about performance on writing and I assume it will have indexes)
Bindings with Pypy 1.6 compatibility (or Python 2.7 at least)
My data goes something like this:
Id: short key string
Title
Creators: an array of another data structure which has an id - used as key -, a name, a site address, etc.
Tags: array of tags. Each of them can has multiple parent tags, a name, an id too, etc.
License: a data structure which describes its license (CC, GPL, ... you say it) with name, associated URL, etc.
Addition time: when it was add in our site.
Translations: pointers to other entries that are translations of one creation.
My queries are very simple. Usual cases are:
Filter by tag ordered by addition time.
Select a few (pagination) ordered by addition time.
(Maybe, not done already) filter by creator.
(Not done but planned) some autocomplete features in forms, so I'm going to need search if some fields contains a substring ('LIKE' queries).
The data volume is not big. Right now I have about 50MB of data but I'm planning to have a huge dataset around 10GB.
Also, I want to rebuild this from scratch, so I'm open to any option. What database do you think can meet my requirements?
Edit: I want to do some benchmarks around different options and share the results. I have selected, so far, MongoDB, PostgreSQL, MySQL, Drizzle, Riak and Kyoto Cabinet.
The path of least resistance for migrating an app engine application will probably be using AppScale, which implements a major portion of the app engine API. In particular, you might want to use the HyperTable data-store, which closely mirrors the Google App Engine datastore.
Edit: ok, so you're going for a redesign. I'd like to go over some of the points you make in your question.
Low memory usage
That's pretty much the opposite of what you want in a database; You want as much of your dataset in core memory as is possible; This could mean tuning the dataset itself to fit efficiently, or adding memcached nodes so that you can spread the dataset across several hosts so that each host has a small enough fraction of the dataset that it fits in core.
To drive this point home, consider that reading a value from ram is about 1000 times faster than reading it from disk; A database that can satisfy every query from core can handle 10 times the workload compared with a database that has to visit the disk for just 1% of its queries.
I'm planning to have a huge dataset around 10GB.
I don't think that you could call 10GB a 'huge dataset'. In fact, that's something that could probably fit in the ram of a reasonably large database server; You wouldn't need more than one memcached node, much less additional persistance nodes (typical disk sizes are in the Terabytes, 100 times larger than this expected dataset.
Based on this information, I would definitely advise using a mature database product like PostgreSQL, which would give you plenty of performance for the data you're describing, easily provides all of the features you're talking about. If the time comes that you need to scale past what PostgreSQL can actually provide, you'll actually have a real workload to analyse to know what the bottlenecks really are.
I would recommend Postresql, only because it does what you want, can scale, is fast, rather easy to work with and stable.
It is exceptionally fast at the example queries given, and could be even faster with document querying.
i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?
I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.
If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).
I'd only put it into a relational database if:
The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.
I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
searching
sorting
indexing
language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.
At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!
I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!
Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.
You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.
Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.
It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.
Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.
Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.
It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.
If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)
Take a look at mongodb.
Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).
Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.
Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.
Just like apt-get.
Is there a storage library for use with Python that will suit my needs better than SQLite?
Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).
Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).
It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.
Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.
Try this out (fairly fast even for million objects):
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.
Let us know how it works out.
I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.
It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.