Evaluate my Python server structure

Evaluate my Python server structure - python

I'm building a game server in Python and I just wanted to get some input on the architecture of the server that I was thinking up.
So, as we all know, Python cannot scale across cores with a single process. Therefore, on a server with 4 cores, I would need to spawn 4 processes.
Here is the steps taken when a client wishes to connect to the server cluster:
The IP the client initially communicates with is the Gateway node. The gateway keeps track of how many clients are on each machine, and forwards the connection request to the machine with the lowest client count.
On each machine, there is one Manager process and X Server processes, where X is the number of cores on the processor (since Python cannot scale across cores, we need to spawn 4 cores to use 100% of a quad core processor)
The manager's job is to keep track of how many clients are on each process, as well as to restart the processes if any of them crash. When a connection request is sent from the gateway to a manager, the manager looks at its server processes on that machine (3 in the diagram) and forwards the request to whatever process has the least amount of clients.
The Server process is what actually does the communicating with the client.
Here is what a 3 machine cluster would look like. For the sake of the diagram, assume each node has 3 cores.
alt text http://img152.imageshack.us/img152/5412/serverlx2.jpg
This also got me thinking - could I implement hot swapping this way? Since each process is controlled by the manager, when I want to swap in a new version of the server process I just let the manager know that it should not send any more connections to it, and then I will register the new version process with the old one. The old version is kept alive as long as clients are connected to it, then terminates when there are no more.
Phew. Let me know what you guys think.

Sounds like you'll want to look at PyProcessing, now included in Python 2.6 and beyond as multiprocessing. It takes care of a lot of the machinery of dealing with multiple processes.
An alternative architectural model is to setup a work queue using something like beanstalkd and have each of the "servers" pull jobs from the queue. That way you can add servers as you wish, swap them out, etc, without having to worry about registering them with the manager (this is assuming the work you're spreading over the servers can be quantified as "jobs").
Finally, it may be worthwhile to build the whole thing on HTTP and take advantage of existing well known and highly scalable load distribution mechanisms, such as nginx. If you can make the communication HTTP based then you'll be able to use lots of off-the-shelf tools to handle most of what you describe.

Related

Python - Querying/Controlling multiple hosts across WAN link?

We have several data-centres located in several countries (Japan, Hong Kong, Singapore etc.).
We run applications on multiple hosts at each of these locations - probably around 50-100 hosts in total.
I'm working on a Python script that queries the status of each application, sends various triggers to them, and retrieves other things from them during runtime. This script could conceivably query a central server, which would then send the request to an agent running on each host.
One of the requirements is that the script is as responsive as possible - e.g. if I query the status of applications on all hosts in all locations, I would like the result within 1-3 seconds, as opposed to 20-30 seconds.
Hence, querying each hosts sequentially would be too slow, particularly considering the WAN hops we'd need to make.
We can assume that the query on each host itself is fairly trivial (e.g. is process running or not).
I'm fairly new to concurrent programming or asynchronous programming, so would value any input at all here. What is the "best" approach to tackling this problem?
Use a multi-threaded or multi-process approach - e.g. spawn a new thread for each host, send them all out, then wait for replies?
Use asyncore, twisted, tornado - any comments on which if any are suitable here? (I get the impression that asyncore isn't that popular. Tornado might be fun to try, but not sure how it could be used here?)
Use some kind of message queue (e.g. Kombu/RabbitMQ)?
Use celery, somehow? Would it be responsive enough for the responsive times we want? (e.g. under 3 seconds for the above).
Cheers,
Victor

Use gevent.
How?
from gevent import monkey; monkey.patch_socket() # So anything socket-based now works asynchronously.
#This should be the first line of you code!
import gevent
def query_server(server_ip):
# do_something with server_ip and sockets
server_ips = [....]
jobs = [gevent.spawn(query_server, server_ip) for server_ip in server_ips]
gevent.joinall(jobs)
print [job.result for job in jobs]
Why bother?
All your code will run in a single process and a single thread. This means you won't have to bother with locks, semaphores and message passing.
Your task seems to be mostly network-bound. Gevent will let you do network-bound work asynchronously, which means your code won't busy-wait on network connections, and instead will let OS notify it when the data is received.
It's a personal preference, but I think that gevent is the easiest asynchronous library to use when you want to do one-off work. (Like, you don't have to start a reactor a-la twisted).
Will it work?
The response-time will be the response time of your slowest server.
If using gevent doesn't do it, then you'll have to fix your network.

Use multiprocessing.Pool, especially the map() or map_async() members.
Write a function that takes a single argument (e.g. the hostname, or a list/tuple of hostname and other data. Let that function query a host and return relevant data.
Now compule a list of input variables (hostnames), and use multiprocessing.Pool.map() or multiprocessing.Pool.map_async() to execute the functions in parallel. The async variant will start returning data sooner, but there is a limit to the amount of work you can do in a callback.
This will automatically use as many cores as your machine has to process the functions in parallel.
If there are network delays however, there is not much the python program can do about that.

Python: OpenMPI Vs. RabbitMQ

Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?

There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.

This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.

Python/Redis Multiprocessing

I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?

Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?

I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/

I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.

My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.

You can have a look at celery

It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

how to process long-running requests in python workers?

I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them

The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example

Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.

There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).

The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.

I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.

You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.

Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.