Best DataMining Database

Best DataMining Database - python

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!

Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.

You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.

Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.

It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.

Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.

Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.

It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.

If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)

Take a look at mongodb.

Related

Persistant database state strategies

Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.

Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.

If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

Setup for high volume of database writing

I am researching a project that would require hundreds of database writes per a minute. I have never dealt with this level of data writes before and I am looking for good scalable techniques and technologies.
I am a comfortable python developer with experience in django and sql alchemy. I am thinking I will build the data interface on django, but I don't think that it is a good idea to go through the orm to do the amount of data writes I will require. I am definitely open to learning new technologies.
The solution will live on Amazon web services, so I have access to all their tools. Ultimately I am looking for advice on database selection, data writing techniques, and any other needs I may have that I do not realize.
Any advice on where to start?
Thanks,
CG

Follow the trends, in other words, enter the world of NOSQL. Some technologies that are worthy include mongodb and redis. They are really fast, scalable, and with decent python drivers. For example, mongodb plays really nice with django, and has many common things with traditional SQL, like MySQL. On the other hand, redis has more "primitive" data structures but is superior in terms of speed (which of course depends somehow on the drivers). Using any of them ( or both, it's a clever idea for something glorious ) you are free ( and sometimes enforced ) to write your own "low-level" logic to accomplish your needs.

You should actually be okay with low hundreds of writes per minute through SQLAlchemy (thats only a couple a second); if you're talking more like a thousand a minute, yeah that might be problematic.
What kind of data do you have? If it's fairly flat (few tables, few relations), you might want to investigate a non-relational database such as CouchDB or Mongo. If you want to use SQL, I strongly reccommend PostgreSQL, it seems to deal with large databases and frequent writes a lot better than MySQL.
It also depends how complex the data is that you're inserting.
I think unfortunately, you're going to just have to try a couple things and run benchmarks, as each situation is different and query optimizers are basically magic.

If it's just a few hundred writes you still can do with a relational DB. I'd pick PostgreSQL (8.0+),
which has a separate background writer process. It also has tuneable serialization levels so you
can enable some tradeoffs between speed and strict ACID compliance, some even at transaction level.
Postgres is well documented, but it assumes some deeper understanding of SQL and relational DB theory to fully understand and make the most of it.
The alternative would be new fangled "NO-SQL" system, which can probably scale even better, but at the cost of buying into a very different technology system.
Any way, if you are using python and it is not 100% critical to lose writes on shutdown or power loss, and you need a low latency, use a threadsafe Queue.Queue and worker threads to decouple the writes from your main application thread(s).

Using Excel to work with large amounts of output data: is an Excel-database interaction the right solution for the problem?

I have a situation where various analysis programs output large amounts of data, but I may only need to manipulate or access certain parts of the data in a particular Excel workbook.
The numbers might often change as well as newer analyses are run, and I'd like these changes to be reflected in Excel in as automated a manner as possible. Another important consideration is that I'm using Python to process some of the data too, so putting the data somewhere where it's easy for Python and Excel to access would be very beneficial.
I know only a little about databases, but I'm wondering if using one would be a good solution for what my needs - Excel has database interaction capability as far as I'm aware, as does Python. The devil is in the details of course, so I need some help figuring out what system I'd actually set up.
From what I've currently read (in the last hour), here's what I've come up with so far simple plan:
1) Set up an SQLite managed database. Why SQLite? Well, I don't need a database that can manage large volumes of concurrent accesses, but I do need something that is simple to set up, easy to maintain and good enough for use by 3-4 people at most. I can also use the SQLite Administrator to help design the database files.
2 a) Use ODBC/ADO.NET (I have yet to figure out the difference between the two) to help Excel access the database. This is going to be the trickiest part, I think.
2 b) Python already has the built in sqlite3 module, so no worries with the interface there. I can use it to set up the output data into an SQLite managed database as well!
Putting down some concrete questions:
1) Is a server-less database a good solution for managing my data given my access requirements? If not, I'd appreciate alternative suggestions. Suggested reading? Things worth looking at?
2) Excel-SQLite interaction: I could do with some help flushing out the details there...ODBC or ADO.NET? Pointers to some good tutorials? etc.
3) Last, but not least, and definitely of concern: will it be easy enough to teach a non-programmer how to setup spreadsheets using queries to the database (assuming they're willing to put in some time with familiarization, but not very much)?
I think that about covers it for now, thank you for your time!

Although you could certainly use a database to do what you're asking, I'm not sure you really want to add that complexity. I don't see much benefit of adding a database to your mix. ...if you were pulling data from a database as well, then it'd make more sense to add some tables for this & use it.
From what I currently understand of your requirements, since you're using python anyway, you could do your preprocessing in python, then just dump out the processed/augmented values into other csv files for Excel to import. For a more automated solution, you could even write the results directly to the spreadsheets from Python using something like xlwt.

Looking for a webapp framework that does away with the database

I've been developing with Django during the last year or so and really
enjoy it. But sometimes I find that the ORM is a bit of a
straitjacket. All the data that I shuffle back and forth to the
database would easily fit into 1GB of RAM. Even if the project grew a
couple of orders of magnitude it would still fit into 1GB.
I'd like a solution where my application only needs to read from disk
at startup but writes to disk implicitly as I update my objects. I
don't care to much about any speed increase this might give me. What
I'm really after is the added flexibility. If I have a problem that
would fit nicely with a linked list or a tree or some other
data structure, I shouldn't have to graft that onto a relational
database.
Python would be nice but other languages are fine. I'm in the exploratory
phase on this one. I want to get a feel for what solutions are
out there. When googling this question I got a lot of hits related to
different Nosql projects. But Nosql, as I understand it, is all about
what you do when you outgrow relational databases because you've got too
much data. I'm really at the other end of the spectrum. I've got so
little data that a relational database is actually overkill.
Object databases is an
other thing that came up when googling this question, which reminded
me of Zope and ZODB. I dabbled a bit with Zope an eon ago and really
disliked it. But reading up a bit on object databases made me think that it
might what I'm looking for. Then again, their general failure to
attract users makes me suspicious. Object databases have been around
for a very long time and still haven't caught on. I guess
that means there's something wrong with them?

If you are looking for "storing data-structures in-memory" and "backing up them to disk", you are really looking for a persistent cache system and Redis fits the bill perfectly.
If you want to use django, there is an awesome cache system built-in and that is pluggable to redis backend via Redis-Cache project. More over the data-structures accomodated by redis have one-to-one mapping to python data-structures, and so it is pretty seamless.
I am not so sure if skipping the concept of database itself, is a good idea. Database provides so much power in terms of aggregating, annotating, relations etc, all within acceptable performance levels until you hit real large scale.
Perhaps another idea would be to to use SQLite in-memory database. SQLite is so ubiquitous these days, it has disappeared into the infrastructure. It is built in, Android apps, iphone ones and has support from all standard libraries. It is also an awesome software developed and tested so well, it is very hard to make any case against using it.

The company I work for (Starcounter) creates a database that works exactly in the way that you describe. We have been running the database for a few years with our partner customers and are about to make the product publicly available. The main reason we created it is for ease of use and performance. I'll be happy to send you a copy if you send me a message on our corporate forum (I’m Starconter Jack).
On the subject on OO databases; the reason that OO databases failed is mainly because they were more experiments than real products. They were poorly implemented, supported only the OO paradigm and ignored standards such as SQL and ODBC. They also lacked stability, performance and maturity. Their story is analogues to the early tablets, eBooks and smartphones a decade or two before the iPhone, iPad and Kindle.
But just as with any technology, there are two waves (look up "the hype cycle"). While the first wave will disappoint, the second wave will be good. The first one will be driven by the concepts and ideas and will lack commercial success and real life usability. The second wave will want nothing to do with the musty smell of the failures of the first one and will therefore use new and fresh acronyms and buzzword.
The future database will spring out of the NoSQL movement. It will have added SQL support and many will think this is novel. It will have added good language integration (and most languages are oo) and many will think this is also novel. It will support documents and many will think this will be novel. Many will rediscover the need for transactions etc. etc.
Some grumpy old men will try to tell us that all we have done is to reinvent existing ideas. In some way they will be right; in some ways they won’t. This time around, the concepts are matured. New ideas will be added and pragmatism will be allowed.
But then again, an iPad is still, in a way, a PDA.

python or database?

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?

I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.

If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).

I'd only put it into a relational database if:
The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.

I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
searching
sorting
indexing
language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.

At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.