As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have huge tables of data that I need to manipulate (sort, calculate new quantities, select specific rows according to some conditions and so on...). So far I have been using a spreadsheet software to do the job but this is really time consuming and I am trying to find a more efficient way to do the job.
I use python but I could not figure out how to use it for such things. I am wondering if anybody can suggest something to use. SQL?!
This is a very general question, but there are multiple things that you can do to possibly make your life easier.
1.CSV These are very useful if you are storing data that is ordered in columns, and if you are looking for easy to read text files.
2.Sqlite3 Sqlite3 is a database system that does not require a server to use (it uses a file instead), and is interacted with just like any other database system. However, for very large scale projects that are handling massive amounts of data, it is not recommended.
3.MySql MySql is a database system that requires a server to interact with, but can be tweaked for very large scale projects, as well as small scale projects.
There are many other different types of systems though, so I suggest you search around and find that perfect fit. However, if you want to mess around with Sqlite3 or CSV, both Sqlite3 and CSV modules are supplied in the standard library with python 2.7 and 3.x I believe.
You will probably appreciate the sqlite3 module in Python standard library:
http://docs.python.org/library/sqlite3.html
You get a SQL database that's stored in a file on disk, with no need to configure a separate database server. It's not appropriate for multiple clients accessing at once, but for a single-threaded analysis application like yours, it's a good fit.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm building a finance application in Python to do time series analysis on security prices (among other things). The heavy lifting will be done in Python mainly using Numpy, SciPy, and pandas (pandas has an interface for SQLite and MySQL). With a web interface to present results. There will be a few hundred GB of data.
I'm curious what is the better option for database in terms of performance, ease of accessing the data (queries), and interface with Python. I've seen the posts about the general pros and cons of SQLite v. MySQL but I'm looking for feedback that's more specific to a Python application.
The correct answer is PostgreSQL. For most platforms it's just as easy to install as MySQL, but it is a better database, and it's especially an improvement on MySQL when it comes to handling large amounts of data, which you are doing.
I wouldn't even begin to consider handling a few hundred GB of data in SQLite.
SQLite is great for embedded databases, but it's not really great for anything that requires access by more than one process at a time. For this reason it cannot be taken seriously for your application.
MySQL is a much better alternative. I'm also in agreement that Postgres would be an even better option.
For many 'research' oriented time series database loads, it is far faster to do as much analysis in the database than to copy the data to a client and analyze it using a regular programming language. Copying 10G across the network is far slower than reading it from disk.
Relational databases do not natively support time series operations, so generating something as simple as security returns from security prices is either impossible or very difficult in both MySQL and SQLite.
Postgres has windowing operations, as do several other relational-like databases; the trade-off is that that they don't do as many transactions per second. Many others use K or Q.
The financial services web apps that I've seen used multiple databases; the raw data was stored in 'research' databases that were multiply indexed and designed for flexibility, while the web-apps interacted directly with in-memory caches and higher-speed RDBs; the tradeoff was that data had to be copied from the 'research' databases to the 'production' databases.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Do you know what speed difference are of:
pickle
shelve
sqlite
some MySQL connector
MongoDB
...
If I wanted to store many dicts, which would be the prefered way and what are differences?
I don't want to drag out the comments, so I'll answer.
Given what you have said:
No server, no existing code. I only want to write a program that locally stores string to string dicts in a file or whatever. No more fancy than that.
I would say your best bet for something so simple is probably something like JSON.
However, if you need to to be super-fast, it may not be the best solution (or it may be - I honestly don't know how it performs in comparison). It's simple, and there are implementations of it for most platforms, which covers a lot of the ground you want. If you want the best speed possible, my advice would be to test it, that's the only way you'll know for sure. Of course, simple is usually a good sign for speed.
You haven't given enough information to know how important performance is here. Remember, unless you need the performance (provably) then don't bother optimising until you do. Go for something easy to read and maintain code-side, and easy to work with file-side. This is why I recommend JSON.
For persistent string to string dict's, anydbm is pretty reasonable. bsddb can be used from anydbm, and is fast but a bit sensitive to being interrupted. gdbm can be used from anydbm, and is slower but not likely to yield a corrupted database.
Also, if you want to read an entire dict into memory, make a lot of changes, and write the resulting dict back out, there's: http://stromberg.dnsalias.org/svn/dohdbm/trunk/ I'm using this one in a backup software project. It'll compress your dictionaries if you want, which can be a performance win if your I/O is particularly slow, or you have a lot of modifications to make.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am a decent c/c++ programmer, but don't know much about web dev. I am interested in twitter/social data mining. So which is a better tool - RoR or Django? I am on level zero in both ruby and python. But python's syntax seemed easier to understand/learn. But the main Qs is that which tool has better mining related APIs?
Thanks!!
They both have all what you need. But Python does better here I think. Python has a very interesting library for text mining called NLTK, and Numpy/Scipy for analytical computations which allow you to achieve almost c comparable performances. On the other hand for pure data mining I'd suggest python+Pandas (Pandas is really well written and fast and there is no ruby equivalent as far as I know) or python + some R code called thru rpy. If in your data mining code you need to compute some symbolic math you can decide to use Sympy (slower because it's written in python but very complete) or Theano (way faster but with less features; it can even make your code run on the GPU thru CUDA)
If you are merely collecting data from twitter, you don't need a MVC frame work like Django or RoR. Actually you can use C++ libraries to collect data from Twitter, store them in database, build the indexing and so on, and then use C or C++ to perform data mining task against your data. Or you can performance the analysis on the go.
If you want to build your own web interface to present your work, or the likes, Django and RoR are both very good and easy to pick up framework.
This is not a real question, please read the faq
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to ask you what programming language I should use to develop a horizontally scalable database. I don't care too much about performance.
Currently, I only know PHP and Python, but I wonder if Python is good for scalability.
Or is this even possible in Python?
The reasons I don't use an existing system is, I need deep insight into the system, and there is no database out there that can store indexes the way I want. (It's a mix of non relational, sparse free multidimensional, and graph design)
EDIT:
I already have most of the core code written in Python and investigated ways to improve adding data for that type of database design, what limits the use of other databases even more.
EDIT 2:
Forgot to note, the database tables are several hundred gigabytes.
The deveopment of a scalable database is language independent, i cannot say much about PHP, but i can tell you good things about Python, it's easy to read, easy to learn, etc. In my opinion it makes the code much cleaner than other languges.
Betweent PHP & Python, definitely Python. Where I work, the entire system is written in Python and it scales quite well.
p.s.: Do take a look at Mongo Db though.
You're looking for MongoDB.
Mongodb has some excellent python drivers. It is a joy to work with.
Since this is clearly a request for "opinion", I thought I'd offer my $.02
We looked at MongoDB 12-months ago, and started to really like it...but for one issue. MongoDB limits the largest database to amount of physical RAM installed on the MongoDB server. For our tests, this meant we were limited to 4 GB databases. This didn't fit our needs, so we walked away (too bad really, because Mongo looked great).
We moved back to home turf, and went with PostgreSQL for our project. It is an exceptional system, with lots to like.
But we've kept an eye on the NoSQL crowd ever since, and it looks like Riak is doing some really interesting work.
(fyi -- it's also possible the MongoDB project has resolved the DB size issue -- we haven't kept up with that project).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Writing an app in Python, and been playing with various ORM setups and straight SQL. All of which are ugly as sin.
I have been looking at ZODB as an object store, and it looks a promising alternative... would you recommend it? What are your experiences, problems, and criticism, particularly regarding developer's perspectives, scalability, integrity, long-term maintenance and alternatives? Anyone start a project with it and ditch it? Why?
Whilst the ideas behind ZODB, Pypersyst and others are interesting, there seems to be a lack of enthusiasm around for them :(
I've used ZODB for more than ten years now, in Zope and outside. It's great if your data is hierarchical. The largest data store a customer operates has maybe. I don't know, 100GB in it? Something on that order of magnitude anyway.
Here is a performance comparison against Postgres.
If you're writing a WSGI web app, these packages may be useful:
repoze.tm2 (docs)
repoze.zodbconn (docs)
Compared to "any key-value store", the key features for ZODB would be automatic integration of attribute changes with real ACID transactions, and clean, "arbitrary" references to other persistent objects.
The ZODB is bigger than just the FileStorage used by default in Zope:
The RelStorage backend lets you put your data in an RDBMS which can be backed up, replicated, etc. using standard tools.
ZEO allows easy scaling of appservers and off-line jobs.
The two-phase commit support allows coordinating transactions among multiple databases, including RDBMSes (assuming that they provide a TPC-aware layer).
Easy hierarchy based on object attributes or containment: you don't need to write recursive self-joins to emulate it.
Filesystem-based BLOB support makes serving large files trivial to implement.
Overall, I'm very happy using ZODB for nearly any problem where the shape of the data is not obviously "square".
I would recommend it.
I really don't have any criticisms. If it's an object store your looking for, this is the one to use. I've stored 2.5 million objects in it before and didn't feel a pinch.
ZODB has been used for plenty of large databases
Most ZODB usage is/was probably Zope users who migrated away if they migrate away from Zope
Performance is not so good as relatonal database+ORM especially if you have lots of writes.
Long term maintenance is not so bad, you want to pack the database from time to time, but that can be done live.
You have to use ZEO if you are going to use more than one process with your ZODB which is quite a lot slower than using ZODB directly
I have no idea how ZODB performs on flash disks.
With pickling you should be able to use any key value database in a similar fashion.