There's an API for Twisted apps to talk to a database in a scalable way: twisted.enterprise.dbapi
The confusing thing is, which database to pick?
The database will have a Twisted app that is mostly making inserts and updates and relatively few selects, and then other strictly-read-only clients that are accessing the database directly making selects.
(The read-only users are not necessarily selecting the data that the Twisted app is inserting; its not as though the database is being used as a message-queue)
My understanding - which I'd like corrected/adviced - is that:
Postgres is a great DB, but almost all the Python bindings - and there is a confusing maze of them - are abandonware
There is psycopg2 for postgres, but that makes a lot of noise about doing its own connection-pooling and things; does this co-exist gracefully/usefully/transparently with the Twisted async database connection pooling and such?
SQLLite is a great database for little things but if used in a multi-user way it does whole-database locking, so performance would suck in the usage pattern I envisage; it also has different mechanisms for typing column values?
MySQL - after the Oracle takeover, who'd want to adopt it now or adopt a fork?
Is there anything else out there?
Scalability
twisted.enterprise.adbapi isn't necessarily an interface for talking to databases in a scalable way. Scalability is a problem you get to solve separately. The only thing twisted.enterprise.adbapi really claims to do is let you use DB-API 2.0 modules without the blocking that normally implies.
Postgres
Yes. This is the correct answer. I don't think all of the Python bindings are abandonware - psycopg2, for example, seems to be actively maintained. In fact, they just added some new bindings for async access which Twisted might eventually offer an interface.
SQLite3 is pretty cool too. You might want to make it possible to use either Postgres or SQLite3 in your app; your unit tests will definitely be happier running against SQLite3, for example, even if you want to deploy against Postgres.
Other?
It's hard to know if another database entirely (something non-relational, perhaps) would fit your application better than Postgres. That depends a lot on the specific data you're going to be storing and the queries you need to run against it. If there are interesting relationships in your database, Postgres does seem like a pretty good answer. If all your queries look like "SELECT foo, bar FROM baz" though, there might be a simpler, higher performance option.
There is the txpostgres library which is a drop in replacement for twisted.enterprise.dbapi, —instead of a thread pool and blocking DB IO, it is fully asynchronous, leveraging the built in async capabilities of psycopg2.
We are using it in production in a big corporation and it's been serving us very well so far. Also, it's actively developed—a bug we reported recently was solved very quickly.
http://pypi.python.org/pypi/txpostgres
https://github.com/wulczer/txpostgres
You could look at nosql databases like mongodb or couchdb with twisted.
Scaling out could be rather easier with nosql based databases than with mysql or postgres.
Related
This question is more on architecture and libs, than on implementation.
I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.
The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.
Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
If not, are there any other Python libraries which have such tools?
Any ideas about how should the architecture look like, if the task must be solved "by hand"?
Added:
It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day
So, the main question is about architecure of such a system
It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.
There are a couple important design choices to consider:
how do you figure out what data changed ( i.e. inserted, updated or deleted )
what is the most efficient way to package the change-log
will you have to deal with conflicts ? and how will you do that.
The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online.
The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.
The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.
It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.
One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.
I'm currently building a web service using python / flask and would like to build my data layer on top of neo4j, since my core data structure is inherently a graph.
I'm a bit confused by the different technologies offered by neo4j for that case. Especially :
i originally planned on using the REST Api through py2neo , but the lack of transaction is a bit of a problem.
The "embedded database" neo4j doesn't seem to suit my case very well. I guess it's useful when you're working with batch and one-time analytics, and don't need to store the database on a different server from the web server.
I've stumbled upon the neo4django project, but i'm not sure this one offers transaction support (since there are no native client to neo4j for python), and if it would be a problem to use it outside django itself. In fact, after having looked at the project's documentation, i feel like it has exactly the same limitations, aka no transaction (but then, how can you build a real-world service when you can corrupt your model upon a single connection timeout ?). I don't even understand what is the use for that project.
Could anyone could recommend anything ? I feel completely stuck.
Thanks
None of the REST API clients will be able to explicitly support (proper) transactions since that functionality is not available through the Neo4j REST API interface. There are a few alternatives such as Cypher queries and batched execution which all operate within a single atomic transaction on the server side; however, my general approach for client applications is to try to build code which can gracefully handle partially complete data, removing the need for explicit transaction control.
Often, this approach will make heavy use of unique indexing and this is one reason that I have provided a large number of "get_or_create" type methods within py2neo. Cypher itself is incredibly powerful and also provides uniqueness capabilities, in particular through the CREATE UNIQUE clause. Using these, you can make your writes idempotent and you can err on the side of "doing it more than once" safe in the knowledge that you won't end up with duplicate data.
Agreed, this approach doesn't give you transactions per se but in most cases it can give you an equivalent end result. It's certainly worth challenging yourself as to where in your application transactions are truly necessary.
Hope this helps
Nigel
I think neo4django makes use of neo4j-rest-client, that does support transactions through the batch resource in the Neo4j REST interface.
The syntax is quite similar to the one used by Neo4j Python emebedded API:
>>> n = gdb.nodes.create()
>>> n["age"] = 25
>>> n["place"] = "Houston"
>>> n.properties
{'age': 25, 'place': 'Houston'}
>>> with gdb.transaction():
....: n.delete("age")
....:
>>> n.properties
{u'place': u'Houston'}
More information can be found in the neo4j-rest-client documentation about transactions.
I have a Pylons application using SQLAlchemy with SQLite as backend. I would like to know if every read operation going to SQLite will always lead to a hard disk read (which is very slow compared to RAM) or some caching mechanisms are already involved.
does SQLite maintain a subset of the database in RAM for faster access ?
Can the OS (Linux) do that automatically ?
How much speedup could I expect by using a production database (MySQL or PostgreSQL) instead of SQLite?
Yes, SQLite has its own memory cache. Check PRAGMA cache_size for instance. Also, if you're looking for speedups, check PRAGMA temp_store. There is also API for implementing your own cache.
The SQLite database is just a file to the OS. Nothing is 'automatically' done for it. To ensure caching does happen, there are sqlite.h defines and runtime pragma settings.
It depends, there are a lot of cases when you'll get a slowdown instead.
How much speedup could I expect by using a production database (Mysql or postgres) instead of sqlite?
Are you using sqlite in a production server environment? You probably shouldn't be:
From Appropriate Uses for Sqlite:
SQLite will normally work fine as the database backend to a website.
But if you website is so busy that you are thinking of splitting the
database component off onto a separate machine, then you should
definitely consider using an enterprise-class client/server database
engine instead of SQLite.
SQLite is not designed well for, and was never intended to scale well; SQLite trades convenience for performance; if performance is a concern, you should consider another DBMS
I put the first steps in BlueBream framework. In my project I must get data from RDBMS - MySQL, PostgreSQL and MS Server. For now, I made a simple tutorial helloworld :) I know how to write Interfaces and Implementations, etc.
My question is: How to set up a connections to RDBMS and multiple connections? Could you give me a simple "step by step" tutorial ?
How to bake a database? Is in BlueBream something like a command "syncdb" in Django?
Take a look at zope.sqlalchemy; it integrates SQLAlchemy into the Zope transaction manager. SQLAlchemy in turn lets you access many different databases, including MySQL, PostgreSQL and MS Server.
The zope.sqlalchemy package explains how to obtain a SQLAlchemy session. From there on out you use bog-standard SQLAlchemy operations for which there are plenty of tutorials and help here on SO. SQLAlchemy can also take care of setting up the database schema for you, if that is needed.
I want to write some unittests for an application that uses MySQL. However, I do not want to connect to a real mysql database, but rather to a temporary one that doesn't require any SQL server at all.
Any library (I could not find anything on google)? Any design pattern? Note that DIP doesn't work since I will still have to test the injected class.
There isn't a good way to do that. You want to run your queries against a real MySQL server, otherwise you don't know if they will work or not.
However, that doesn't mean you have to run them against a production server. We have scripts that create a Unit Test database, and then tear it down once the unit tests have run. That way we don't have to maintain a static test database, but we still get to test against the real server.
I've used python-mock and mox for such purposes (extremely lightweight tests that absolutely cannot require ANY SQL server), but for more extensive/in-depth testing, starting and populating a local instance of MySQL isn't too bad either.
Unfortunately SQLite's SQL dialect and MySQL's differ enough that trying to use the former for tests is somewhat impractical, unless you're using some ORM (Django, SQLObject, SQLAlchemy, ...) that can hide the dialect differences.