Backend processing for Django

Backend processing for Django - python

I'm working on a turn-based web game that will perform all world updates (player orders, physics, scripted events, etc.) on the server. For now, I could simply update the world in a web request callback. Unfortunately, that naive approach is not at all scalable. I don't want to bog down my web server when I start running many concurrent games.
So what is the best way to separate the load from the web server, ideally in a way that could even be run on a separate machine?
A simple python module with infinite loop?
A distributed task in something like Celery?
Some sort of cross-platform Cron scheduler?
Some other fancy Django feature or third-party library that I don't know about?
I also want to minimize code duplication by using the same model layer. That probably means my service would need access to the Django model code, so that definitely determines how I architect the service.

I think Celery, which you mention in your question, is the way to go here. It will interface nicely with the rest of your setup, support your eventual aim of separating out the systems, and is compatible with Django.

I'd just write the backend to just use the Django database interface (look at the setup code in your manage.py), spawn it as its own process, and interface to it with Protocol Buffers. That route should move to a separate machine with little work. MPI may be an option, too.
Pipes, FIFOs, and most other IPC requires both processes to be on the same box.
Though I have to point out a flaw in your premise:
Unfortunately, that naive approach is not at all scalable. I don't want to bog down my web server when I start running many concurrent games.
If you run concurrent games, so long as you keep all the parts for a given game on the same server, this is a non-issue, unless there's a common resource needed by all games. Then the real issue becomes load balancing across the servers.

Related

Splitting a Django project

I have a django project with various apps, which are completely independent. I'd like to make them run each one in their own process, as some of them spawn background threads to precalculate periodically some data and now they are competing for the CPU (the machine has loads of cores, but you know, the GIL and such...)
So, is there an easy way to split automatically the project into different ones, or at least to make each app live in its own process?

You can always have different settings files, but that would be like having multiple projects and even multiple endpoints. With some effort you could configure a reverse proxy to forward to the right Django server, based on the request's path and so on, but I don't think that's what you want and it would be an ugly solution to your problem.
The solution to this is to move the heavy processing to a jobs queue. A lot of people and projects prefer Celery for this.
If that seems like overkill for some reason, you can always implement your own based on simple cron jobs. You can take a look at my small project that does this.

The simplest of the simple is probably to write a custom management command that observes given model (database table) for new entries and processes them. The model is written to by e.g. Django view and the management command is launched periodically from cron (e.g. every 5 minutes).
Example: user registers on the site, but the account creation is an expensive operation (allocating some space, pinging remote services etc.). Therefore you just write a new record to AccountRequest table (AccountRequest.objects.create(...)). Then, cron periodically launches your management script (./manage.py account_creator), which checks for new AccountRequest-s (AccountRequest.objects.filter(unprocessed=True)), does its job and marks those requests as processed.

How to implement a long-running, event-driven python program?

I have a series of maintenance tasks for a python WSGI application that are a bit too complex for a crontab (jobs need to be run at frequencies derived from the size of the job queue, manage a connection pool to a group of EC2 instances, etc).
How should I implement a long-running, event-driven python program? I've never needed this functionality before, so I'm not even sure what to google.

Most of the large, modern python sites are using Celery for this type of work. It is a distributed task queue that supports scheduling of tasks as well.
Though probably a bit heavyweight for a small site, it'll grow with you. I'm looking to implement it myself (sans Rabbit) shortly.
I recently found another choice for django users, django-tasks which is focused on fewer, longer, batch processing type jobs. There is also django-ztask using zeromq.
Addendum: Just came across gearman which has python bindings.

Design pattern for multiple consumers and a single data source

I am designing a web interface to a certain hardware appliance that provides its own custom API. Said web interface can manage multiple appliances at once. The data is retrieved from appliance through polling with the custom API so it'd be preferable to make it asynchronous.
The most obvious thing is to have a poller thread that polls for data, saves into a process wide singleton with semaphores and then the web server threads will retrieve data from said singleton and show it. I'm not a huge fan of singletons or mashed together designs, so I was thinking of maybe separating the poller datasource from the web server, looping it back on the local interface and using something like XML-RPC to consume data.
The application need not be 'enterprisey' or scalable really since it'll at most be accessed by a couple people at a time, but I'd rather make it robust by not mixing two kinds of logic together. There's a current implementation in python using CherryPy and it's the biggest mishmash of terrible design I've ever seen. I feel that if I go with the most obvious design I'll just end up reimplementing the same horrible thing my own way.

If you use Django and celery, you can create a Django project to be the web interface and a celery job to run in the background and poll. In that job, you can import your Django models so it can save the results of the polling very simply.

having to run multiple instances of a web service for ruby/python seems like a hack to me

Is it just me or is having to run multiple instances of a web server to scale a hack?
Am I wrong in this?
Clarification
I am referring to how I read people run multiple instances of a web service on a single server. I am not talking about a cluster of servers.

Not really, people were running multiple frontends across a cluster of servers before multicore cpus became widespread
So there has been all the infrastructure for supporting sessions properly across multiple frontends for quite some time before it became really advantageous to run a bunch of threads on one machine.
Infact using asynchronous style frontends gives better performance on the same hardware than a multithreaded approach, so I would say that not running multiple instances in favour of a multithreaded monster is a hack

Since we are now moving towards more cores, rather than faster processors - in order to scale more and more, you will need to be running more instances.
So yes, I reckon you are wrong.
This does not by any means condone brain-dead programming with the excuse that you can just scale it horizontally, that just seems retarded.

With no details, it is very difficult to see what you are getting at. That being said, it is quite possible that you are simply not using the right approach for your problem.
Sometimes multiple separate instances are better. Sometimes, your Python services are actually better deployed behind a single Apache instance (using mod_wsgi) which may elect to use more than a single process. I don't know about Ruby to opinionate there.
In short, if you want to make your service scalable then the way to do so depends heavily on additional details. Is it scaling up or scaling out? What is the operating system and available or possibly installable server software? Is the service itself easily parallelized and how much is it database dependent? How is the database deployed?

Even if Ruby/Python interpreters were perfect, and could utilize all avail CPU with single process, you would still reach maximal capability of single server sooner or later and have to scale across several machines, going back to running several instances of your app.

I would hesitate to say that the issue is a "hack". Or indeed that threaded solutions are necessarily superior.
The situation is a result of design decisions used in the interpreters of languages like Ruby and Python.
I work with Ruby, so the details may be different for other languages.
BUT ... essentially, Ruby uses a Global Interpreter Lock to prevent threading issues:
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
The side-effect of this is that to achieve concurrency with frameworks like Rails, rather than relying on multiple threads within the VM, we use multiple processes, each with its own interpreter and instance of your framework and application code
Each instance of the app handles a single request at a time. To achieve concurrency we have to spin up multiple instances.
In the olden days (2-3 years ago) we would run multiple mongrel (or similar) instances behind a proxy (generally apache). Passenger changed some of this because it is smart enough to manage the processes itself, rather than requiring manual setup. You tell Passenger how many processes it can use and off it goes.
The whole structure is actually not as bad as the thread-orthodoxy would have you believe. For a start, it's pretty easy to make this type of architecture work in a multicore environment. Any modern database is designed to handle highly concurrent loads, so having multiple processes has very little if any effect at that level.
If you use a language like JRuby you can deploy into a threaded app server like Tomcat and have a deployment that looks much more "java-like". However, this is not as big a win as you might think, because now your application needs to be much more thread-aware and you can see side effects and strangeness from threading issues.

Your assumption that Tomcat's and IIS's single process per server is superior is flawed. The choice of a multi-threaded server and a multi-process server depends on a lot of variables.
One main thing is the underlying operating system. Unix systems have always had great support for multi-processing because of the copy-on-write nature of the fork system call. This makes multi-processes a really attractive option because web-serving is usually very shared-nothing and you don't have to worry about locking. Windows on the other hand had much heavier processes and lighter threads so programs like IIS would gravitate to a multi-threading model.
As for the question to wether it's a hack to run multiple servers really depends on your perspective. If you look at Apache, it comes with a variety of pluggable engines to choose from. The MPM-prefork one is the default because it allows the programmer to easily use non-thread-safe C/Perl/database libraries without having to throw locks and semaphores all over the place. To some that might be a hack to work around poorly implemented libraries. To me it's a brilliant way of leaving it to the OS to handle the problems and letting me get back to work.
Also a multi-process model comes with a few features that would be very difficult to implement in a multi-threaded server. Because they are just processes, zero-downtime rolling-updates are trivial. You can do it with a bash script.
It also has it's short-comings. In a single-server model setting up a singleton that holds some global state is trivial, while on a multi-process model you have to serialize that state to a database or Redis server. (Of course if your single-process server outgrows a single server you'll have to do that anyway.)
Is it a hack? Yes and no. Both original implementations (MRI, and CPython) have Global Interpreter Locks that will prevent a multi-core server from operating at it's 100% potential. On the other hand multi-process has it's advantages (especially on the Unix-side of the fence).
There's also nothing inherent in the languages themselves that makes them require a GIL, so you can run your application with Jython, JRuby, IronPython or IronRuby if you really want to share state inside a single process.

CherryPy for a webhosting control panel application

For quite a long time I've wanted to start a pet project that will aim in
time to become a web hosting control panel, but mainly focused on Python hosting --
meaning I would like to make a way for users to generate/start Django/
other frameworks projects right from the panel. I seemed to have
found the perfect tool to build my app with it: CherryPy.
This would allow me to do it the way I want, building the app with its own HTTP/
HTTPS server and also all in my favorite programming language.
But now a new question arises: As CherryPy is a threaded server, will
it be the right for this kind of task?
There will be lots of time consuming tasks so if one of the
tasks blocks, the rest of the users trying to access other pages will
be left waiting and eventually get timed out.
I imagine that this kind of problem wouldn't happen on a fork based server.
What would you advise?

"Threaded" and "Fork based" servers are equivalent. A "threaded" server has multiple threads of execution, and if one blocks then the others will continue. A "Fork based" server has multiple processes executing, and if one blocks then the others will continue. The only difference is that threaded servers by default will share memory between the threads, "fork based" ones by default will not share memory.
One other point - the "subprocess" module is not thread safe, so if you try to use it from CherryPy you will get wierd errors. (This is Python Bug 1731717)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.