What's a distributed lock and why use it? - python

Why do people need a distributed lock?
When the shared resource is protected by it's local machine, does this mean that we do not need a distributed lock?
I mean when the shared resource is exposed to others by using some kind of api or service, and this api or service is protected using it's local locks; then we do not need this kind of distributed lock; am I right?

After asking people on quora. I think I got the answer.
Let's say N worker server access a database server. There are two parts here:
The database has its own lock methods to protect the data from corrupted by concurrent access of other clients (N work server). This is where a local lock in the database comes out.
N worker servers may need some coronation to make sure that they are doing the right thing, and this is application specific. This is where a distributed lock comes out. Say, if two worker server running a same process that drops a table of the database and add some record to the table. The database server can sure guarantee that its internal data is right, but this two process needs a distributed lock to coordinate each other, otherwise, one process will drop another process's table.

Yes and no, if you're exposing your information from the local API through a lock to prevent mutex depending on how the lock is setup your implementation might be that of exactly what a distributed lock is trying to accomplish, but if you haven't developed the API then you'll have to dig into the source of the API to find out if it's a localized or distributed locking system. Honestly a lock is a lock is a lock, it's attempting to do the same thing no matter what. The benefit of the distributed lock over a localized one is that you're already accounting for queueing to prevent over access to from clients on expensive cache points.

Related

Share state between threads in bottle

In my Bottle app running on pythonanywhere, I want objects to be persisted between requests.
If I write something like this:
X = {'count': 0}
#route('/count')
def count():
X['count'] += 1
tpl = SimpleTemplate('Hello {{count}}!')
return tpl.render(count=X['count'])
The count increments, meaning that X persists between requests.
I am currently running this on pythonanywhere, which is a managed service where I have no control over the web server (nginx I presume?) threading, load balancing (if any) etc...
My question is, is this coincidence because it's only using one thread while on minimal load from me doing my tests?
More generally, at which point will this stop working? E.g. I have more than one thread/socket/instance/load-balanced server etc...?
Beyond that, what is my best options to make something like this work (sticking to Bottle) even if I have to move to a barebones server.
Here's what Bottle docs have to say about their request object:
A thread-safe instance of LocalRequest. If accessed from within a request callback, this instance always refers to the current request (even on a multi-threaded server).
But I don't fully understand what that means, or where global variables like the one I used stand with regards to multi-threading.
TL;DR: You'll probably want to use an external database to store your state.
If your application is tiny, and you're planning to always have exactly one server process running, then your current approach can work; "all" you need to do is acquire a lock around every (!) access to the shared state (the dict X in your sample code). (I put "all" in scare quotes there because it's likely to become more complicated than it sounds at first.)
But, since you're asking about multithreading, I'll assume that your application is more than a toy, meaning that you plan to receive substantial traffic and/or want to handle multiple requests concurrently. In this case, you'll want multiple processes, which means that your approach--storing state in memory--cannot work. Memory is not shared across processes. The (general) way to share state across processes is to store the state externally, e.g. in a database.
Are you familiar with Redis? That'd be on my short list of candidates.
I go the answers by contacting PythonAnywhere support, who had this to say:
When you run a website on a free PythonAnywhere account, just
one process handles all of your requests -- so a global variable like
the one you use there will be fine. But as soon as you want to scale
up, and get (say) a hacker account, then you'll have multiple processes
(not, not threads) -- and of course each one will have its own global
variables, so things will go wrong.
So that part deals with the PythonAnywhere specifics on why it works, and when it would stop working on there.
The answer to the second part, about how to share variables between multiple Bottle processes, I also got from their support (most helpful!) once they understood that a database would not work well in this situation.
Different processes cannot of course share variables, and the most viable solution would be to:
write your own kind of caching server to handle keeping stuff in memory [...] You'd have one process that ran all of the time, and web API requests would access it somehow (an internal REST API?). It could maintain stuff in memory [...]
Ps: I didn't expect other replies to tell me to store state in a database, I figured that the fact I'm asking this means I have a good reason not to use a database, apologies for time wasted!

Tornado multi-process mode: how can I send users to the same process?

I have a stateful concept Tornado application which enables users (a small user base) to do some CPU bound tasks with potentially large in-memory objects. This poses a problem in a single-threaded configuration, because one user can impact the experience of another.
I could use multiprocessing to ship the task to another process, but this would require repeated copying of this large data and would not be ideal.
I found that Tornado can be configured to be multi-process. I thought this would resolve my issue for the time being; different users get different processes. However what I've found is that references to objects would go missing when interacting with the web application. I assume it's because Tornado sends me to a potentially different process per API invocation, and the object I interacted previously with did not exist in the current process.
Thus my question: Can I configure Tornado to a client/user to the same process repeatedly?
You can't do this with Tornado's multi-process mode, or any solution that involves multiple processes all listening on a single port. Instead, you need to run your Tornado processes independently on different ports, and use a separate load balancer that can distribute requests to them intelligently (for example, nginx with the ip_hash option)

How to ensure several Python processes access the data base one by one?

I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.

error when use multithreading and mysqldb

getting error when multithreader program access data .
Exception in thread Thread-2:
ProgrammingError: (2014, "Commands out of sync; you can't run this command now")
Exception in thread Thread-3:
ProgrammingError: execute() first
According to PEP 249, data access modules have a module level constant threadsafety:
Integer constant stating the level of thread safety the interface
supports. Possible values are:
0 Threads may not share the module.
1 Threads may share the module, but not connections.
2 Threads may share the module and connections.
3 Threads may share the module, connections and
cursors.
Sharing in the above context means that two threads may use a resource
without wrapping it using a mutex semaphore to implement resource
locking. Note that you cannot always make external resources thread
safe by managing access using a mutex: the resource may rely on global
variables or other external sources that are beyond your control.
According to MySQLdb User's Guide, the module supports level 1.
The MySQL protocol can not handle multiple threads using the same
connection at once. Some earlier versions of MySQLdb utilized locking
to achieve a threadsafety of 2. While this is not terribly hard to
accomplish using the standard Cursor class (which uses
mysql_store_result()), it is complicated by SSCursor (which uses
mysql_use_result(); with the latter you must ensure all the rows have
been read before another query can be executed. It is further
complicated by the addition of transactions, since transactions start
when a cursor execute a query, but end when COMMIT or ROLLBACK is
executed by the Connection object. Two threads simply cannot share a
connection while a transaction is in progress, in addition to not
being able to share it during query execution. This excessively
complicated the code to the point where it just isn't worth it.
The general upshot of this is: Don't share connections between
threads. It's really not worth your effort or mine, and in the end,
will probably hurt performance, since the MySQL server runs a separate
thread for each connection. You can certainly do things like cache
connections in a pool, and give those connections to one thread at a
time. If you let two threads use a connection simultaneously, the
MySQL client library will probably upchuck and die. You have been
warned.
here is detail about the error: http://dev.mysql.com/doc/refman/5.0/en/commands-out-of-sync.html.
Mysqldb's manual suggest these:
Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
For threaded applications, try using a connection pool. This can be done using the Pool module
See more information search keyword threadsafety on MySQLdb manual,
Thanks to your few informations, I can only guess.
Probably you access the database from several threads without locking. That's bad.
You should hold a lock to a threading.Lock() or threading.RLock() while accessing your DB. This prevents several threads from interfering with other thread's actions.

Evaluate my Python server structure

I'm building a game server in Python and I just wanted to get some input on the architecture of the server that I was thinking up.
So, as we all know, Python cannot scale across cores with a single process. Therefore, on a server with 4 cores, I would need to spawn 4 processes.
Here is the steps taken when a client wishes to connect to the server cluster:
The IP the client initially communicates with is the Gateway node. The gateway keeps track of how many clients are on each machine, and forwards the connection request to the machine with the lowest client count.
On each machine, there is one Manager process and X Server processes, where X is the number of cores on the processor (since Python cannot scale across cores, we need to spawn 4 cores to use 100% of a quad core processor)
The manager's job is to keep track of how many clients are on each process, as well as to restart the processes if any of them crash. When a connection request is sent from the gateway to a manager, the manager looks at its server processes on that machine (3 in the diagram) and forwards the request to whatever process has the least amount of clients.
The Server process is what actually does the communicating with the client.
Here is what a 3 machine cluster would look like. For the sake of the diagram, assume each node has 3 cores.
alt text http://img152.imageshack.us/img152/5412/serverlx2.jpg
This also got me thinking - could I implement hot swapping this way? Since each process is controlled by the manager, when I want to swap in a new version of the server process I just let the manager know that it should not send any more connections to it, and then I will register the new version process with the old one. The old version is kept alive as long as clients are connected to it, then terminates when there are no more.
Phew. Let me know what you guys think.
Sounds like you'll want to look at PyProcessing, now included in Python 2.6 and beyond as multiprocessing. It takes care of a lot of the machinery of dealing with multiple processes.
An alternative architectural model is to setup a work queue using something like beanstalkd and have each of the "servers" pull jobs from the queue. That way you can add servers as you wish, swap them out, etc, without having to worry about registering them with the manager (this is assuming the work you're spreading over the servers can be quantified as "jobs").
Finally, it may be worthwhile to build the whole thing on HTTP and take advantage of existing well known and highly scalable load distribution mechanisms, such as nginx. If you can make the communication HTTP based then you'll be able to use lots of off-the-shelf tools to handle most of what you describe.

Categories