ZODB In Real Life [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Writing an app in Python, and been playing with various ORM setups and straight SQL. All of which are ugly as sin.
I have been looking at ZODB as an object store, and it looks a promising alternative... would you recommend it? What are your experiences, problems, and criticism, particularly regarding developer's perspectives, scalability, integrity, long-term maintenance and alternatives? Anyone start a project with it and ditch it? Why?
Whilst the ideas behind ZODB, Pypersyst and others are interesting, there seems to be a lack of enthusiasm around for them :(

I've used ZODB for more than ten years now, in Zope and outside. It's great if your data is hierarchical. The largest data store a customer operates has maybe. I don't know, 100GB in it? Something on that order of magnitude anyway.
Here is a performance comparison against Postgres.
If you're writing a WSGI web app, these packages may be useful:
repoze.tm2 (docs)
repoze.zodbconn (docs)

Compared to "any key-value store", the key features for ZODB would be automatic integration of attribute changes with real ACID transactions, and clean, "arbitrary" references to other persistent objects.
The ZODB is bigger than just the FileStorage used by default in Zope:
The RelStorage backend lets you put your data in an RDBMS which can be backed up, replicated, etc. using standard tools.
ZEO allows easy scaling of appservers and off-line jobs.
The two-phase commit support allows coordinating transactions among multiple databases, including RDBMSes (assuming that they provide a TPC-aware layer).
Easy hierarchy based on object attributes or containment: you don't need to write recursive self-joins to emulate it.
Filesystem-based BLOB support makes serving large files trivial to implement.
Overall, I'm very happy using ZODB for nearly any problem where the shape of the data is not obviously "square".

I would recommend it.
I really don't have any criticisms. If it's an object store your looking for, this is the one to use. I've stored 2.5 million objects in it before and didn't feel a pinch.

ZODB has been used for plenty of large databases
Most ZODB usage is/was probably Zope users who migrated away if they migrate away from Zope
Performance is not so good as relatonal database+ORM especially if you have lots of writes.
Long term maintenance is not so bad, you want to pack the database from time to time, but that can be done live.
You have to use ZEO if you are going to use more than one process with your ZODB which is quite a lot slower than using ZODB directly
I have no idea how ZODB performs on flash disks.

With pickling you should be able to use any key value database in a similar fashion.

Related

MySQL v. SQLite for Python based financial web app [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm building a finance application in Python to do time series analysis on security prices (among other things). The heavy lifting will be done in Python mainly using Numpy, SciPy, and pandas (pandas has an interface for SQLite and MySQL). With a web interface to present results. There will be a few hundred GB of data.
I'm curious what is the better option for database in terms of performance, ease of accessing the data (queries), and interface with Python. I've seen the posts about the general pros and cons of SQLite v. MySQL but I'm looking for feedback that's more specific to a Python application.
The correct answer is PostgreSQL. For most platforms it's just as easy to install as MySQL, but it is a better database, and it's especially an improvement on MySQL when it comes to handling large amounts of data, which you are doing.
I wouldn't even begin to consider handling a few hundred GB of data in SQLite.
SQLite is great for embedded databases, but it's not really great for anything that requires access by more than one process at a time. For this reason it cannot be taken seriously for your application.
MySQL is a much better alternative. I'm also in agreement that Postgres would be an even better option.
For many 'research' oriented time series database loads, it is far faster to do as much analysis in the database than to copy the data to a client and analyze it using a regular programming language. Copying 10G across the network is far slower than reading it from disk.
Relational databases do not natively support time series operations, so generating something as simple as security returns from security prices is either impossible or very difficult in both MySQL and SQLite.
Postgres has windowing operations, as do several other relational-like databases; the trade-off is that that they don't do as many transactions per second. Many others use K or Q.
The financial services web apps that I've seen used multiple databases; the raw data was stored in 'research' databases that were multiply indexed and designed for flexibility, while the web-apps interacted directly with in-memory caches and higher-speed RDBs; the tradeoff was that data had to be copied from the 'research' databases to the 'production' databases.

sorting and selecting data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have huge tables of data that I need to manipulate (sort, calculate new quantities, select specific rows according to some conditions and so on...). So far I have been using a spreadsheet software to do the job but this is really time consuming and I am trying to find a more efficient way to do the job.
I use python but I could not figure out how to use it for such things. I am wondering if anybody can suggest something to use. SQL?!
This is a very general question, but there are multiple things that you can do to possibly make your life easier.
1.CSV These are very useful if you are storing data that is ordered in columns, and if you are looking for easy to read text files.
2.Sqlite3 Sqlite3 is a database system that does not require a server to use (it uses a file instead), and is interacted with just like any other database system. However, for very large scale projects that are handling massive amounts of data, it is not recommended.
3.MySql MySql is a database system that requires a server to interact with, but can be tweaked for very large scale projects, as well as small scale projects.
There are many other different types of systems though, so I suggest you search around and find that perfect fit. However, if you want to mess around with Sqlite3 or CSV, both Sqlite3 and CSV modules are supplied in the standard library with python 2.7 and 3.x I believe.
You will probably appreciate the sqlite3 module in Python standard library:
http://docs.python.org/library/sqlite3.html
You get a SQL database that's stored in a file on disk, with no need to configure a separate database server. It's not appropriate for multiple clients accessing at once, but for a single-threaded analysis application like yours, it's a good fit.

Speed of standard Python data persistence methods [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Do you know what speed difference are of:
pickle
shelve
sqlite
some MySQL connector
MongoDB
...
If I wanted to store many dicts, which would be the prefered way and what are differences?
I don't want to drag out the comments, so I'll answer.
Given what you have said:
No server, no existing code. I only want to write a program that locally stores string to string dicts in a file or whatever. No more fancy than that.
I would say your best bet for something so simple is probably something like JSON.
However, if you need to to be super-fast, it may not be the best solution (or it may be - I honestly don't know how it performs in comparison). It's simple, and there are implementations of it for most platforms, which covers a lot of the ground you want. If you want the best speed possible, my advice would be to test it, that's the only way you'll know for sure. Of course, simple is usually a good sign for speed.
You haven't given enough information to know how important performance is here. Remember, unless you need the performance (provably) then don't bother optimising until you do. Go for something easy to read and maintain code-side, and easy to work with file-side. This is why I recommend JSON.
For persistent string to string dict's, anydbm is pretty reasonable. bsddb can be used from anydbm, and is fast but a bit sensitive to being interrupted. gdbm can be used from anydbm, and is slower but not likely to yield a corrupted database.
Also, if you want to read an entire dict into memory, make a lot of changes, and write the resulting dict back out, there's: http://stromberg.dnsalias.org/svn/dohdbm/trunk/ I'm using this one in a backup software project. It'll compress your dictionaries if you want, which can be a performance win if your I/O is particularly slow, or you have a lot of modifications to make.

Programming a scalable database [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to ask you what programming language I should use to develop a horizontally scalable database. I don't care too much about performance.
Currently, I only know PHP and Python, but I wonder if Python is good for scalability.
Or is this even possible in Python?
The reasons I don't use an existing system is, I need deep insight into the system, and there is no database out there that can store indexes the way I want. (It's a mix of non relational, sparse free multidimensional, and graph design)
EDIT:
I already have most of the core code written in Python and investigated ways to improve adding data for that type of database design, what limits the use of other databases even more.
EDIT 2:
Forgot to note, the database tables are several hundred gigabytes.
The deveopment of a scalable database is language independent, i cannot say much about PHP, but i can tell you good things about Python, it's easy to read, easy to learn, etc. In my opinion it makes the code much cleaner than other languges.
Betweent PHP & Python, definitely Python. Where I work, the entire system is written in Python and it scales quite well.
p.s.: Do take a look at Mongo Db though.
You're looking for MongoDB.
Mongodb has some excellent python drivers. It is a joy to work with.
Since this is clearly a request for "opinion", I thought I'd offer my $.02
We looked at MongoDB 12-months ago, and started to really like it...but for one issue. MongoDB limits the largest database to amount of physical RAM installed on the MongoDB server. For our tests, this meant we were limited to 4 GB databases. This didn't fit our needs, so we walked away (too bad really, because Mongo looked great).
We moved back to home turf, and went with PostgreSQL for our project. It is an exceptional system, with lots to like.
But we've kept an eye on the NoSQL crowd ever since, and it looks like Riak is doing some really interesting work.
(fyi -- it's also possible the MongoDB project has resolved the DB size issue -- we haven't kept up with that project).

What are the limitations of Django's ORM? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've heard developers not wanting to use ORM, but don't know why. What are the shortcomings of the ORM?
Let me start by saying that I fully advocate the use of the ORM for most simple cases. It offers a lot of convenience when working with a very straightforward (relational) data model.
But, since you asked for shortcomings...
From a conceptual point of view, an ORM can never be an effective representation of the underlying data model. It will, at best, be an approximation of your data - and most of the time, this is enough.
The problem is that an ORM will map on a "one class -> one table" basis, which doesn't always work.
If you have a very complex data model - one which, ideally, cannot be properly represented by a single DB table - then you may find that you spend a lot of time fighting against the ORM, rather than having it work for you.
On a practical level, you'll find that there is always a workaround; some developers will be partisan in their support for/against an ORM, but I favour a hybrid approach. The Django works well for this, as you can easily drop into raw SQL as needed. Something like:
Model.objects.raw("SELECT ...")
ORMs take a lot of the work out of the 99.99% of other cases, when you're performing simple CRUD operations against your data.
In my experience, the two best reasons to avoid an ORM altogether are:
When you have complex data that is frequently retrieved via multiple joins and aggregations. Often, writing the SQL by hand will be clearer.
Performance. ORMs are pretty good at constructing optimised queries, but nothing can compete with writing a nice, efficient piece of SQL.
But, when all's said and done, after working extensively with Django, I can count on one hand the number of occasions that the ORM hasn't allowed me to do what I want.
creator of SQLAlchemy's response to the question is django considered now pythonic.. This shows a lots of difference and deep understanding of the system.
sqlalchemy_vs_django_db discussion in reddit
Note: Both the links are pretty long, will take time to read. I am not writing gist of them which may lead to misunderstanding.
Another answer from a Django fan, but:
If you use inheritance and query for parent classes, you can't get children (while you can with SQLAlchemy).
Group By and Having clauses are really hard to translate using the aggregate/annotate.
Some queries the ORM make are just ridiculously long, and sometimes you and up with stuff like model.id IN [1, 2, 3... ludicrous long list]
There is a way ask for raw where "stuff is in field" using __contains, but not "field is in stuff". Since there is no portable way to do this accross DBMS, writting raw SQL for it is really annoying. A lot of small edge cases like this one appear if your application starts to be complex, because as #Gary Chambers said, data in the DBMS doesn't always match the OO model.
It's an abstraction, and sometimes, the abstraction leaks.
But more than often, the people I meet that don't want to use an ORM do it for the wrong reason: intellectual laziness. Some people won't make the effort to give a fair try to something because they know something and want to stick to it. And it's scary how many of them you can find in computer science, where a good part of the job is about keeping up with the new stuff.
Of course, in some area it just make sense. But usually someone with good reason not to use it, will use it in other cases. I never met any serious computer scientist saying to it all, just people not using it in some cases, and being able to explain why.
And to be fair, a lot of programmers are not computer scientists, there are biologists, mathematician, teachers or Bob, the guy next door that just wanna help. From their point of view, it's prefectly logical to not spend hours to learn new stuff when you can do what you want with your toolbox.
There are various problems that seem to arise with every Object-Relational Mapping system, about which I think the classic article is by Ted Neward, who described the subject as "The Vietnam of Computer Science". (There's also a followup in response to comments on that post and some comments from Stack Overflow's own Jeff Atwood here.)
In addition, one simple practical problem with ORM systems is they make it hard to see how many queries (and which queries) are actually being run by a given bit of code, which obviously can lead to performance problems. In Django, using the assertNumQueries assertion in your unit tests really helps to avoid this, as does using django-devserver, a replacement for runserver that can output queries as they're being performed.
One of the biggest problem that come to mind is that Building inheritance into Django ORM's is difficult. Essentially this is due to the fact that (Django) ORM layers are trying to bridge the gap by being both relational & OO. Another thing is of course multiple field foreign keys.
One charge leveled at Django ORM is that they abstract away so much of the database engine that writing efficient, scalable applications with them is impossible. For some kinds of applications - those with millions of accesses and highly interrelated models — this assertion is often true.
The vast majority of Web applications never reach such huge audiences and don't achieve that level of complexity. Django ORMs are designed to get projects off the ground quickly and to help developers jump into database-driven projects without requiring a deep knowledge of SQL. As your Web site gets bigger and more popular, you will certainly need to audit performance as described in the first section of this article. Eventually, you may need to start replacing ORM-driven code with raw SQL or stored procedures (read SQLAlchemy etc).
Happily, the capabilities of Django's ORMs continue to evolve. Django V1.1's aggregation library is a major step forward, allowing efficient query generation while still providing a familiar object-oriented syntax. For even greater flexibility, Python developers should also look at SQLAlchemy, especially for Python Web applications that don't rely on Django.
IMHO the bigger issue with Django ORM is the lack of composite primary keys, this prevents me from using some legacy databases with django.contrib.admin.
I do prefer SqlAlchemy over Django ORM, for projects where django.contrib.admin is not important I tend to use Flask instead of Django.
Django 1.4 is adding some nice "batch" tools to the ORM.

Categories