I am researching a project that would require hundreds of database writes per a minute. I have never dealt with this level of data writes before and I am looking for good scalable techniques and technologies.
I am a comfortable python developer with experience in django and sql alchemy. I am thinking I will build the data interface on django, but I don't think that it is a good idea to go through the orm to do the amount of data writes I will require. I am definitely open to learning new technologies.
The solution will live on Amazon web services, so I have access to all their tools. Ultimately I am looking for advice on database selection, data writing techniques, and any other needs I may have that I do not realize.
Any advice on where to start?
Thanks,
CG
Follow the trends, in other words, enter the world of NOSQL. Some technologies that are worthy include mongodb and redis. They are really fast, scalable, and with decent python drivers. For example, mongodb plays really nice with django, and has many common things with traditional SQL, like MySQL. On the other hand, redis has more "primitive" data structures but is superior in terms of speed (which of course depends somehow on the drivers). Using any of them ( or both, it's a clever idea for something glorious ) you are free ( and sometimes enforced ) to write your own "low-level" logic to accomplish your needs.
You should actually be okay with low hundreds of writes per minute through SQLAlchemy (thats only a couple a second); if you're talking more like a thousand a minute, yeah that might be problematic.
What kind of data do you have? If it's fairly flat (few tables, few relations), you might want to investigate a non-relational database such as CouchDB or Mongo. If you want to use SQL, I strongly reccommend PostgreSQL, it seems to deal with large databases and frequent writes a lot better than MySQL.
It also depends how complex the data is that you're inserting.
I think unfortunately, you're going to just have to try a couple things and run benchmarks, as each situation is different and query optimizers are basically magic.
If it's just a few hundred writes you still can do with a relational DB. I'd pick PostgreSQL (8.0+),
which has a separate background writer process. It also has tuneable serialization levels so you
can enable some tradeoffs between speed and strict ACID compliance, some even at transaction level.
Postgres is well documented, but it assumes some deeper understanding of SQL and relational DB theory to fully understand and make the most of it.
The alternative would be new fangled "NO-SQL" system, which can probably scale even better, but at the cost of buying into a very different technology system.
Any way, if you are using python and it is not 100% critical to lose writes on shutdown or power loss, and you need a low latency, use a threadsafe Queue.Queue and worker threads to decouple the writes from your main application thread(s).
Related
So I'm moving from PostgreSQL to Redis as main database in my project to get the best performance. Now my code contains a lot of HSET, HMSET, HGET, SADD, SREM, SINTER, SORT and other calls to Redis spread everywhere throughout the code. We don't have any ORM with Redis nor any data model description. While I have the structure of my data in my mind, it's not a big problem for me to maintain this code. But once another dev needs to work with it, or even me after just a year, it will become a huge pain to get it all together and be sure you don't miss anything.
Thus my question. What are the best practices for organizing the data-storage code without ORMs and proper data models? I see a lot of articles on how to call Redis here and there to store data in it, but no recommendations whatsoever on how to do it without making your code a mess.
Ended up using rom: https://github.com/josiahcarlson/rom
It allowed me to describe models clearly and use active record to work with them. Takes care of indexes and stuff under the hood, quite simple to use and nice
I am coding a psychology experiment in Python. I need to store user information and scores somewhere, and I need it to work as a web application (and be secure).
Don't know much about this - I'm considering XML databases, BerkleyDB, sqlite, an openoffice spreadsheet, or I'm very interested in the python "shelve" library.
(most of my info coming from this thread: http://developers.slashdot.org/story/08/05/20/2150246/FOSS-Flat-File-Database
DATA: I figure that I'm going to have maximally 1000 users. For each user I've got to store...
Username / Pass
User detail fields (for a simple profile)
User scores on the exercise (2 datapoints: each trial gets a score (correct/incorrect/timeout, and has an associated number from 0.1 to 1.0 that I need to record)
Metadata about the trials (when, who, etc.)
Results of data analysis for user
VERY rough estimate, each user generates 100 trials / day. So maximum of 10k datapoints / day. It needs to run that way for about 3 months, so about 1m datapoints. Safety multiplier 2x gives me a target of a database that can handle 2m datapoints.
((note: I could either store trial response data as individual data points, or group trials into Python list objects of varying length (user "sessions"). The latter would dramatically bring down the number database entries, though not the amount of data. Does it matter? How?))
I want a solution that will work (at least) until I get to this 1000 users level. If my program is popular beyond that level, I'm alright with doing some work modding in a beefier DB. Also reiterating that it must be easily deployable as a web application.
Beyond those basic requirements, I just want the easiest thing that will make this work. I'm pretty green.
Thanks for reading
Tr3y
SQLite can certainly handle those amount of data, it has a very large userbase with a few very well known users on all the major platforms, it's fast, light, and there are awesome GUI clients that allows you to browse and extract/filter data with a few clicks.
SQLite won't scale indefinitely, of course, but severe performance problems begins only when simultaneous inserts are needed, which I would guess is a problem appearing several orders of magnitude after your prospected load.
I'm using it since a few years now, and I never had a problem with it (although for larger sites I use MySQL). Personally I find that "Small. Fast. Reliable. Choose any three." (which is the tagline on SQLite's site) is quite accurate.
As for the ease of use... SQLite3 bindings (site temporarily down) are part of the python standard library. Here you can find a small tutorial. Interestingly enough, simplicity is a design criterion for SQLite. From here:
Many people like SQLite because it is small and fast. But those qualities are just happy accidents. Users also find that SQLite is very reliable. Reliability is a consequence of simplicity. With less complication, there is less to go wrong. So, yes, SQLite is small, fast, and reliable, but first and foremost, SQLite strives to be simple.
There's a pretty spot-on discussion of when to use SQLite here. My favorite line is this:
Another way to look at SQLite is this: SQLite is not designed to replace Oracle. It is designed to replace fopen().
It seems to me that for your needs, SQLite is perfect. Indeed, it seems to me very possible that you will never need anything else:
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (2^41 bytes).
It doesn't sound like you'll have that much data at any point.
I would consider MongoDB. It's very easy to get started, and is built for multi-user setups (unlike SQLite).
It also has a much simpler model. Instead of futzing around with tables and fields, you simply take all the data in your form and stuff it in the database. Even if your form changes (oops, forgot a field) you won't need to change MongoDB.
I've been developing with Django during the last year or so and really
enjoy it. But sometimes I find that the ORM is a bit of a
straitjacket. All the data that I shuffle back and forth to the
database would easily fit into 1GB of RAM. Even if the project grew a
couple of orders of magnitude it would still fit into 1GB.
I'd like a solution where my application only needs to read from disk
at startup but writes to disk implicitly as I update my objects. I
don't care to much about any speed increase this might give me. What
I'm really after is the added flexibility. If I have a problem that
would fit nicely with a linked list or a tree or some other
data structure, I shouldn't have to graft that onto a relational
database.
Python would be nice but other languages are fine. I'm in the exploratory
phase on this one. I want to get a feel for what solutions are
out there. When googling this question I got a lot of hits related to
different Nosql projects. But Nosql, as I understand it, is all about
what you do when you outgrow relational databases because you've got too
much data. I'm really at the other end of the spectrum. I've got so
little data that a relational database is actually overkill.
Object databases is an
other thing that came up when googling this question, which reminded
me of Zope and ZODB. I dabbled a bit with Zope an eon ago and really
disliked it. But reading up a bit on object databases made me think that it
might what I'm looking for. Then again, their general failure to
attract users makes me suspicious. Object databases have been around
for a very long time and still haven't caught on. I guess
that means there's something wrong with them?
If you are looking for "storing data-structures in-memory" and "backing up them to disk", you are really looking for a persistent cache system and Redis fits the bill perfectly.
If you want to use django, there is an awesome cache system built-in and that is pluggable to redis backend via Redis-Cache project. More over the data-structures accomodated by redis have one-to-one mapping to python data-structures, and so it is pretty seamless.
I am not so sure if skipping the concept of database itself, is a good idea. Database provides so much power in terms of aggregating, annotating, relations etc, all within acceptable performance levels until you hit real large scale.
Perhaps another idea would be to to use SQLite in-memory database. SQLite is so ubiquitous these days, it has disappeared into the infrastructure. It is built in, Android apps, iphone ones and has support from all standard libraries. It is also an awesome software developed and tested so well, it is very hard to make any case against using it.
The company I work for (Starcounter) creates a database that works exactly in the way that you describe. We have been running the database for a few years with our partner customers and are about to make the product publicly available. The main reason we created it is for ease of use and performance. I'll be happy to send you a copy if you send me a message on our corporate forum (I’m Starconter Jack).
On the subject on OO databases; the reason that OO databases failed is mainly because they were more experiments than real products. They were poorly implemented, supported only the OO paradigm and ignored standards such as SQL and ODBC. They also lacked stability, performance and maturity. Their story is analogues to the early tablets, eBooks and smartphones a decade or two before the iPhone, iPad and Kindle.
But just as with any technology, there are two waves (look up "the hype cycle"). While the first wave will disappoint, the second wave will be good. The first one will be driven by the concepts and ideas and will lack commercial success and real life usability. The second wave will want nothing to do with the musty smell of the failures of the first one and will therefore use new and fresh acronyms and buzzword.
The future database will spring out of the NoSQL movement. It will have added SQL support and many will think this is novel. It will have added good language integration (and most languages are oo) and many will think this is also novel. It will support documents and many will think this will be novel. Many will rediscover the need for transactions etc. etc.
Some grumpy old men will try to tell us that all we have done is to reinvent existing ideas. In some way they will be right; in some ways they won’t. This time around, the concepts are matured. New ideas will be added and pragmatism will be allowed.
But then again, an iPad is still, in a way, a PDA.
The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.
Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?
Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?
How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?
I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.
I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.
It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.
In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.
You are describing my dreams! However, I think you cannot do it. I always wanted something just like LINQ for Python but the language does not permit to use Python syntax for native database operations AFAIK. If it would be possible, you could just write code using lists and then use the same code for retrieving data from a database.
I would not recommend you to write a lot of code based only in lists and sets because it will not be easy to migrate it to a scalable platform. I recommend you to use something like an ORM. GAE even has its own ORM-like system and you can use other ones such as SQLAlchemy and SQLObject with e.g. SQLite.
Unfortunately, you cannot use awesome stuff such as list comprehensions to filter data from the database. Surely, you can filter data after it was gotten from the DB but you'll still need to build a query with some SQL-like language for querying objects or return a lot of objects from a database.
OTOH, there is Buzhug, a curious non-relational database system written in Python which allows the use of natural Python syntax. I have never used it and I do not know if it is scalable so I would not put my money on it. However, you can test it and see if it can help you.
You can use ORM: Object Relational Mapping: A class gets a table, an objects gets a row. I like the Django ORM. You can use it for non-web apps, too. I never used it on GAE, but I think it is possible.
I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!
Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.
You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.
Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.
It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.
Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.
Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.
It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.
If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)
Take a look at mongodb.