pymongo: Advantage of using MongoReplicaSetClient?

pymongo: Advantage of using MongoReplicaSetClient? - python

It seems that both MongoClient and MongoReplicaSetClient can connect to mongo replica sets. In fact, their documentation pages are nearly identical - same options, same methods, etc - except that the latter's constructor requires me to specify a replicaSet.
In both cases, we may specify a read preference. In both cases, we must handle the AutoReconnect exception if a stepdown occurs.
So my questions are:
Why would one use one versus the other, since one can perform the exact same operations with both?
Both can perform secondary reads, correct? The documentation says that the advantage of a ReplicaSetClient is that we can do secondary reads, but clearly they are supported in both.
The documentation says that the ReplicaSetClient features "replica set health monitoring." What exactly does that mean? Are there new methods I can invoke which tell me about a replset's health that I cannot otherwise do with MongoClient?
In theory a MongoReplicaSetClient will connect to all members of the replset, rather than just one. This is false: you may munge or omit any of the servers in the connection string, and both MongoClient and MongoReplicaSetClient are still able to connect. Am I missing something?

This is was a confusing API choice that we regret in PyMongo 2.x. We will merge all the client classes into MongoClient in PyMongo 3, in April 2015:
http://emptysqua.re/blog/good-idea-at-the-time-pymongo-mongoreplicasetclient/
Meanwhile:
Use MongoReplicaSetClient when you plan to connect to a whole replica set. MongoClient only connects to one member.
A single MongoReplicaSetClient can be used to perform primary or secondary reads, as well as more sophisticated decision-making with read preferences, see my blog post on the subject. A MongoClient will connect to one member of the replica set (the primary) and always read from it, unless you make a direct connection to a secondary using MongoClient, in which case it will always read from that secondary.
MongoReplicaSetClient monitors the set's health with a background thread that periodically checks on all the members. The client tracks whether members are up, it tracks their ping times, and it notices when a member is added. This will reduce the number of exceptions you see on a flaky network or when the replica set's configuration changes, and it allows the client to correctly implement read preferences.
A MongoReplicaSetClient does in fact connect to all members, whereas a MongoClient only connects to one member. MongoReplicaSetClient tries to connect to each member listed in the connection string; as soon as it connects to one it asks that member for a list of all other members. From this point forward it ignores your connection string and uses the list it got from the member it connected to.

Related

Etags used in RESTful APIs are still susceptible to race conditions

Maybe I'm overlooking something simple and obvious here, but here goes:
So one of the features of the Etag header in a HTTP request/response it to enforce concurrency, namely so that multiple clients cannot override each other's edits of a resource (normally when doing a PUT request). I think that part is fairly well known.
The bit I'm not so sure about is how the backend/API implementation can actually implement this without having a race condition; for example:
Setup:
RESTful API sits on top of a standard relational database, using an ORM for all interactions (SQL Alchemy or Postgres for example).
Etag is based on 'last updated time' of the resource
Web framework (Flask) sits behind a multi threaded/process webserver (nginx + gunicorn) so can process multiple requests concurrently.
The problem:
Client 1 and 2 both request a resource (get request), both now have the same Etag.
Both Client 1 and 2 sends a PUT request to update the resource at the same time. The API receives the requests, proceeds to uses the ORM to fetch the required information from the database then compares the request Etag with the 'last updated time' from the database... they match so each is a valid request. Each request continues on and commits the update to the database.
Each commit is a synchronous/blocking transaction so one request will get in before the other and thus one will override the others changes.
Doesn't this break the purpose of the Etag?
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
P.S Tagged as Python due to the frameworks used but this should be a language/framework agnostic problem.

This is really a question about how to use ORMs to do updates, not about ETags.
Imagine 2 processes transferring money into a bank account at the same time -- they both read the old balance, add some, then write the new balance. One of the transfers is lost.
When you're writing with a relational DB, the solution to these problems is to put the read + write in the same transaction, and then use SELECT FOR UPDATE to read the data and/or ensure you have an appropriate isolation level set.
The various ORM implementations all support transactions, so getting the read, check and write into the same transaction will be easy. If you set the SERIALIZABLE isolation level, then that will be enough to fix race conditions, but you may have to deal with deadlocks.
ORMs also generally support SELECT FOR UPDATE in some way. This will let you write safe code with the default READ COMMITTED isolation level. If you google SELECT FOR UPDATE and your ORM, it will probably tell you how to do it.
In both cases (serializable isolation level or select for update), the database will fix the problem by getting a lock on the row for the entity when you read it. If another request comes in and tries to read the entity before your transaction commits, it will be forced to wait.

Etag can be implemented in many ways, not just last updated time. If you choose to implement the Etag purely based on last updated time, then why not just use the Last-Modified header?
If you were to encode more information into the Etag about the underlying resource, you wouldn't be susceptible to the race condition that you've outlined above.
The only fool proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
That's your answer.
Another option would be to add a version to each of your resources which is incremented on each successful update. When updating a resource, specify both the ID and the version in the WHERE. Additionally, set version = version + 1. If the resource had been updated since the last request then the update would fail as no record would be found. This eliminates the need for locking.

You are right that you can still get race conditions if the 'check last etag' and 'make the change' aren't in one atomic operation.
In essence, if your server itself has a race condition, sending etags to the client won't help with that.
You already mentioned a good way to achieve this atomicity:
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example.
You could do something else, like using a mutex lock. Or using an architecture where two threads cannot deal with the same data.
But the database check seems good to me. What you describe about ORM checks might be an addition for better error messages, but is not by itself sufficient as you found.

Give each user their own database

I'm constructing my app such that each user has their own database (for easy isolation, and to minimize the need for sharding). This means that each web request, and all of the background scripts, need to connect to a different database based on which user is making the request, and use that connection for all function calls.
I figure I can make some sort of middleware that would pass the right connection to my web requests by attaching it to the request variable, but I don't know how I should ensure that all functions and model methods called by the request use this connection.

Well how to "ensure that all functions and model methods called by the request use this connection" is easy. You pass the connection into your api as with any well-designed code that isn't relying on global variables for such things. So you have a database session object loaded per-request, and you pass it down. It's very easy for model objects to turtle that session object further without explicitly passing it because each managed object knows what session owns it, and you can query it from there.
db = request.db
user = db.query(User).get(1)
user.add_group('foo')
class User(Base):
def add_group(self, name):
db = sqlalchemy.orm.object_session(self)
group = Group(name=name)
db.add(group)
I'm not recommending you use that exact pattern but it serves as an example of how to grab the session from a managed object, avoiding having to pass the session everywhere explicitly.
On to your original question, how to handle multi-tenancy... In your data model! Designing a system where you are splitting things up at that low of a level is a big maintenance burden and it does not scale well. For example it becomes very difficult to use any type of connection pooling when you have an arbitrary number of independent connections. To get around that people commonly use the SQL SCHEMA feature supported by some databases. That allows you to use the same connection but have access to a different table structure per session. That's better, but again managing all of those schemas independently should raise some red flags, violating DRY with all of that duplication in your data model. Any duplication at that level quickly becomes a burden that you need to be ready for.

Why django and python MySQLdb have one cursor per database?

Example scenario:
MySQL running a single server -> HOSTNAME
Two MySQL databases on that server -> USERS , GAMES .
Task -> Fetch 10 newest games from GAMES.my_games_table , and fetch users playing those games from USERS.my_users_table ( assume no joins )
In Django as well as Python MySQLdb , why is having one cursor for each database more preferable ?
What is the disadvantage of an extended cursor which is single per MySQL server and can switch databases ( eg by querying "use USERS;" ), and then work on corresponding database
MySQL connections are cheap, but isn't single connection better than many , if there is a linear flow and no complex tranasactions which might need two cursors ?

A shorter answer would be, "MySQL doesn't support that type of cursor", so neither does Python-MySQL, so the reason one connection command is preferred is because that's the way MySQL works. Which is sort of a tautology.
However, the longer answer is:
A 'cursor', by your definition, would be some type of object accessing tables and indexes within an RDMS, capable of maintaining its state.
A 'connection', by your definition, would accept commands, and either allocate or reuse a cursor to perform the action of the command, returning its results to the connection.
By your definition, a 'connection' would/could manage multiple cursors.
You believe this would be the preferred/performant way to access a database as 'connections' are expensive, and 'cursors' are cheap.
However:
A cursor in MySQL (and other RDMS) is not a the user-accessible mechanism for performing operations. MySQL (and other's) perform operations in as "set", or rather, they compile your SQL command into an internal list of commands, and do numerous, complex bits depending on the nature of your SQL command and your table structure.
A cursor is a specific mechanism, utilized within stored procedures (and there only), giving the developer a way to work with data in a procedural way.
A 'connection' in MySQL is what you think of as a 'cursor', sort of. MySQL does not expose it's internals for you as an iterator, or pointer, that is merely moving over tables. It exposes it's internals as a 'connection' which accepts SQL and other commands, translates those commands into an internal action, performs that action, and returns it's result to you.
This is the difference between a 'set' and a 'procedural' execution style (which is really about the granularity of control you, the user, is given access to, or at least, the granularity inherent in how the RDMS abstracts away its internals when it exposes them via an API).

As you say, MySQL connections are cheap, so for your case, I'm not sure there is a technical advantage either way, outside of code organization and flow. It might be easier to manage two cursors than to keep track of which database a single cursor is currently talking to by painstakingly tracking SQL 'USE' statements. Mileage with other databases may vary -- remember that Django strives to be database-agnostic.
Also, consider the case where two different databases, even on the same server, require different access credentials. In such a case, two connections will be necessary, so that each connection can successfully authenticate.

One cursor per database is not necessarily preferable, it's just the default behavior.
The rationale is that different databases are more often than not on different servers, use different engines, and/or need different initialization options. (Otherwise, why should you be using different "databases" in the first place?)
In your case, if your two databases are just namespaces of tables (what should be called "schemas" in SQL jargon) but reside on the same MySQL instance, then by all means use a single connection. (How to configure Django to do so is actually an altogether different question.)
You are also right that a single connection is better than two, if you only have a single thread and don't actually need two database workers at the same time.

Managing a pool of connections to a hosted Elastic Search provider

I need a way to manage connections to a hosted Elastic Search provider, to speed up search on my website. We are running Django on Heroku, using the Found ElasticSearch add-on, and pyes, which is an ElasticSearch Python library.
The standard way of setting up a connection to ElasticSearch with pyes is by passing the provider URL into an ES object, like so:
(1) connection = ES(my_elasticsearch_url)
Pyes uses the ES object behind the scenes to establish an open HTTP connection to my ElasticSearch provider, so I can run searches like this:
(2) results = connection.search(some_query, index_name)
Previously, I was doing both those steps in my Django view for search -- every time a user did a search, it opened a new HTTP connection then ran the search. Consequentially the search call was slow.
I sped up search by moving (1) into my app's __init__.py file -- now, I am setting up the connection only once, and importing it into the search view. But I'm worried it will choke that HTTP connection if lots of people are trying search at once.
I'm looking for ideas on how to set up a pool of connections, initiate them once on app start up, and then dole them out to my search view as needed. Ideally I'd like to be able to scale the size of the pool up and down easily with minimal changes to my code.
I can think of a few ways to approach it, but it seems like a common computing related problem, so I'm sure that a lot of you have ideas on good design and best practices for such a system. I'd love to hear them.
Thanks a lot!
Clay

If your running in a multi-threaded environment, it's merely a matter of extending Queue.Queue to create an instance that can fetch and instantiate connections on demand, from multiple threads in which your views are handling the request-response flow. You'll probably want to have a certain cap on how many connections your retaining by limiting the maximum size of the queue, although you can instantiate more connections beyond that and simply discard them if you can put them back into the queue.
The downside of using Queue.Queue is that it can create cross-cutting concerns if your views are responsible for retrieving connections from and returning them back into the queue. You can get a healthier design if you only queue the actual object from pyes.ES that holds the connection and create a wrapper for ES that, when performing a query, creates a new ES instance, fetches a connection from the queue, sets it on the instance, performs the query, returns the connection back into the queue, discards the ES instance and returns the query results.

How do I create a D-Bus service that dynamically creates multiple objects?

I'm new to D-Bus (and to Python, double whammy!) and I am trying to figure out the best way to do something that was discussed in the tutorial.
However, a text editor application
could as easily own multiple bus names
(for example, org.kde.KWrite in
addition to generic TextEditor), have
multiple objects (maybe
/org/kde/documents/4352 where the
number changes according to the
document), and each object could
implement multiple interfaces, such as
org.freedesktop.DBus.Introspectable,
org.freedesktop.BasicTextField,
org.kde.RichTextDocument.
For example, say I want to create a wrapper around flickrapi such that the service can expose a handful of Flickr API methods (say, urls_lookupGroup()). This is relatively straightforward if I want to assume that the service will always be specifying the same API key and that the auth information will be the same for everyone using the service.
Especially in the latter case, I cannot really assume this will be true.
Based on the documentation quoted above, I am assuming there should be something like this:
# Get the connection proxy object.
flickrConnectionService = bus.get_object("com.example.FlickrService",
"/Connection")
# Ask the connection object to connect, the return value would be
# maybe something like "/connection/5512" ...
flickrObjectPath = flickrConnectionService.connect("MY_APP_API_KEY",
"MY_APP_API_SECRET",
flickrUsername)
# Get the service proxy object.
flickrService = bus.get_object("com.example.FlickrService",
flickrObjectPath);
# As the flickr service object to get group information.
groupInfo = flickrService.getFlickrGroupInfo('s3a-belltown')
So, my questions:
1) Is this how this should be handled?
2) If so, how will the service know when the client is done? Is there a way to detect if the current client has broken connection so that the service can cleanup its dynamically created objects? Also, how would I create the individual objects in the first place?
3) If this is not how this should be handled, what are some other suggestions for accomplishing something similar?
I've read through a number of D-Bus tutorials and various documentation and about the closest I've come to seeing what I am looking for is what I quoted above. However, none of the examples look to actually do anything like this so I am not sure how to proceed.

1) Mostly yes, I would only change one thing in the connect method as I explain in 2).
2) D-Bus connections are not persistent, everything is done with request/response messages, no connection state is stored unless you implement this in third objects as you do with your flickerObject. The d-bus objects in python bindings are mostly proxies that abstract the remote objects as if you were "connected" to them, but what it really does is to build messages based on the information you give to D-Bus object instantiation (object path, interface and so). So the service cannot know when the client is done if client doesn't announce it with other explicit call.
To handle unexpected client finalization you can create a D-Bus object in the client and send the object path to the service when connecting, change your connect method to accept also an ObjectPath parameter. The service can listen to NameOwnerChanged signal to know if a client has died.
To create the individual object you only have to instantiate an object in the same service as you do with your "/Connection", but you have to be sure that you are using an unexisting name. You could have a "/Connection/Manager", and various "/Connection/1", "/Connection/2"...
3) If you need to store the connection state, you have to do something like that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.