I am working on a project which is I/O bound.
I have 3 dependent tasks:
1. scraping a site + extracting the main content(removing comments/ads etc)
2. as soon as 1 completes it sends the data to a summerizer
3. as soon as 2 completes it calls a view and renders a page
I know Python and Django at the moment. What technologies do you recommend me for this project? (I know that Python + Twisted or node.js are ideal for I/O bound projects).
If you're already using Python, you're probably better off sticking with a Python library, especially when there are so many powerful asynchronous Python libraries. Node.js is fine, but switching between Python and Javascript is unnecessary.
Anyway, your question is very very vague. You can absolutely use Twisted and it will probably do what you want just fine, as long as you learn the API well enough. Other asynchronous frameworks include gevent and a web server called Tornado.
There's also Celery which is used specifically for asynchronous processing of queues. It may or may not be helpful to what you want.
I recommend you do a lot of research, look at the documentation of the above libraries, and decide what'll fit your project best. If you have more specific questions you can ask the respective IRC channels of the library, or post a clearer question here.
I am finally using django-socketio.
https://github.com/stephenmcd/django-socketio
In case websockets are not supported, socketio falls back to long polling.
Related
I intend to start off a new chat web application which allows users to join a chatroom and participate in the chat. I've heard a lot about how Node.js will be perfect for this. Plus, there are a lot of tutorials online that demonstrate building a Node + socket.io chat application. Personally, I have never given Node a shot. I know javascript well enough to work with Jquery and Backbone but I've been avoiding Node due to my preference for Python for web development. What do you guys suggest? Should I try the app in Python ( I have no idea where to get started) or should I spend some time and learn Node?
Thanks a lot!
I'm personally not a big fan of writing Python, and while I love Node and would recommend giving it a shot sometime, if you already know Python there's no reason you can't use it for this task; you may be interested in checking out Twisted or Tornado.
I will say that one of the big plusses for using Node.js for evented programming (as compared to doing it in other languages) is that all I/O is asynchronous by default in Node.js. In other environments, you need to make sure you only use non-blocking libraries.
Node.js is a preferred framework for a chat like application because it is very good with handling conditions which are more data intensive rather than cpu bound. Personally i am a big fan of node.js myself. BUT i am going to step up here and tell you that,
The syntax of node.js for handling asynchronous events becomes a pain once your project grows out of a simple example into a fully grown application. I mean how long will you do this.
response.onComplete( function(data) {
data.parseJson( function( json ) {
json.getElement('hoo', function( value ) {
value.HowDoIEscapeNow()
.....
I do not mean to say anything against node.js but imho its a completely different beast once you go into complexities.
I have to write a CPU-bound server in Python, to distribute workloads for many cores. I want to use Twisted as the server (requests coming in via TCP).
Are there any better options - using Ampoule, perhaps? I also saw an article using Twisted's pb for communication, in conjunction with Popen - or perhaps combine it with multiprocessing?
Ampoule is a good building block for a multiprocess CPU-bound server. It uses the simpler AMP protocol, rather than PB (the complexity of which is usually not needed just to move job data into another process and then retrieve the results). It handles process creation, lifetime management, restarting, etc.
You generally want to avoid using Popen or the multiprocessing standard library module if you're using Twisted. They can cooperate, but they both present blocking-oriented APIs which somewhat defeat the purpose of using Twisted in the first place. Twisted's native child process API, reactor.spawnProcess is as capable and avoids blocking. Ampoule is based on this.
Ampoule is not as widely used as multiprocessing, though. You may find it to have some quirks in your development or deployment environments. I don't think these will be obstacles you can't overcome, though. I developed a service which used Ampoule to distribute the work of parsing large quantities of HTML across multiple CPUs and it eventually worked fine. If you do come across any problems though I encourage you to report them upstream! Eventually I would like to be able to say that Ampoule is as robust as anything (or more so) instead of attaching a disclaimer about its use. :)
I have been trying to determine which combination of packages to use for a push messaging service behind a web site...
My current idea is to go with Tornado + Socket.IO (Tornadio) and ZMQ. But I was also looking at involving Mongrel2. Then there is also a similar project called Brubeck, that takes from Tornado, using ZMQ and Eventlet.
My main question is this... I'm trying to understand where the benefit of Mongrel2 would come into play if I were to use Tornado. At that point, is Tornado even necessary? I figured at that point I would just be writing a Mongrel2 python handler and thats it. I would like to focus on using websockets/jssockets which is why using Socket.IO was interesting since it handles all the backwards compatibility under the hood for you.
If the tools in the mix for consideration are: Python focus, Tornado, Mongrel2, ZMQ, Brubeck, and Socket.IO, what recommendations would you have for the best mix to support websockets? Having Mongrel2 was really appealing for the idea of scalability, and just turning on more python handlers.
Update 1/1/2012
At first went with Tornado + TornadIO + ZeroMQ, and had a working server. But ultimately I ended up learning Go (www.golang.org) and rewrote my server using pure Go with its built in concurrency. Ended up being faster than python by over 10x even with more features than my Python version: http://www.justinfx.com/2011/07/28/go-language-for-python-programmers/
It seems to keep on picking up speed as the Go team makes more releases towards Go 1.0
Sounds like a job for the Flash/Javascript binding. http://www.zeromq.org/bindings:javascript
That way you have a ZMQ app in the browser that is a SUB to whatever PUB sockets are pushing relevant messages.
I am adding my own update to this question as the answer, since I never received any other answers, and so I can close this one down...
At first went with Tornado + TornadIO + ZeroMQ, and had a working server. But ultimately I ended up learning Go (www.golang.org) and rewrote my server using pure Go with its built in concurrency. Ended up being faster than python by over 10x even with more features than my Python version: http://www.justinfx.com/2011/07/28/go-language-for-python-programmers/
It seems to keep on picking up speed as the Go team makes more releases towards Go 1.0
UPDATE: after much laboring with Py3, including writing my own asynchronous webserver (following a presentation given by Dave Beazley), i finally dumped Python (and a huge stack of my code )-: in favor of CoffeeScript running on NodeJS. Check it out: GitHub (where you'll find like 95% of all interesting code these days), npm (package manager that couldn't be any user friendly; good riddance, easy_install, you never lived up to your name), an insanely huge repository of modules (with tons of new stuff being published virtually 24/7), a huge and vibrant community, out-of-the-box asynchronous HTTP and filehandling..., all that (thanks to V8) at one third the speed of light — what's not to like? read more propaganda: "The future of Scripting" (slide hosting courtesy SpreeWebdesign).
I am looking for a way to serve HTTP (and do HTTP requests) in an asynchronous, non-blocking fashion. This seems to be hard to do when you’ve decided on Stackless Python 3.1 (also see here for docs) as i did.
There are some basic examples, like the pretty informative and detailed article How To Use Linux epoll with Python, and there is a a Google code project named stacklessexamples which contains some valuable information (but no Python 3.x compatible code).
So, after many days of doing research on the web and trying to put together the pieces i’ve found so far: does anyone know of a fairly usable asynchronous HTTP library? It doesn’t have to be WSGI-compliant (I am not interested in that).
The server part should be able to serve multiple non-blocking HTTP requests (and possibly do the basics of HTTP header processing); the HTTP client part should be able to retrieve, in a non-blocking way, web content via HTTP requests (also doing basic header processing, but no fancy stuff like authorization or so).
My research so far has shown me that non-blocking HTTP
is the only way that makes sense in a stackless, cooperatively scheduled environment;
is feasible in Stackless Python 3 by virtue of the standard library’s select epoll (introduced in Py2.6; some solutions prefer libevent, but that means another hurdle as the pyevent project seems to have stopped developing at Py2.5);
is sadly still not a household item, with most people relying on blocking HTTP.
The way it looks like now, i would have to learn the basics of socket programming and roll my own HTTP server/client library. I still shy away from that task as i have very little background in that area and am bound to ‘repeat history’ that way.
I would be very happy about any relevant pointers. I prefer very much solutions that make use of select.epoll; i seem to remember it is much more scalable that the older asyncore (but maybe someone has more precise data on this). As a minimum requirement, solutions should run on Ubuntu 9.10.
I know this is like resurrecting the dead (and flow has probably long since solved his problem), but for completeness stackless is available for 3.1.3:
http://www.stackless.com/download
For information on implementing a HTTP server using stacklesssocket:
http://code.google.com/p/stacklessexamples/wiki/StacklessNetworking
Non blocking HTTP case is very well handled with twisted, what is does is creating a series of callbacks, and registering those callbacks with deferred. Twisted documentation is worth checking out. Stackless uses microthreads but twisted is coding the entire web framework using fragment by fragment non bloking code chained with callbacks, errbacks and deferreds running is a main reactor loop over a single thread. Think this should the Async HTTP thing better.
I'm writing a simple site spider and I've decided to take this opportunity to learn something new in concurrent programming in Python. Instead of using threads and a queue, I decided to try something else, but I don't know what would suit me.
I have heard about Stackless, Celery, Twisted, Tornado, and other things. I don't want to have to set up a database and the whole other dependencies of Celery, but I would if it's a good fit for my purpose.
My question is: What is a good balance between suitability for my app and usefulness in general? I have taken a look at the tasklets in Stackless but I'm not sure that the urlopen() call won't block or that they will execute in parallel, I haven't seen that mentioned anywhere.
Can someone give me a few details on my options and what would be best to use?
Thanks.
Tornado is a web server, so it wouldn't help you much in writing a spider. Twisted is much more general (and, inevitably, complex), good for all kinds of networking tasks (and with good integration with the event loop of several GUI frameworks). Indeed, there used to be a twisted.web.spider (but it was removed years ago, since it was unmaintained -- so you'll have to roll your own on top of the facilities Twisted does provide).
I must say that Twisted gets my vote.
Performing event-drive tasks is fairly straightforward in Twisted. Integration with other important system components such as GTK+ and DBus is very easy.
The HTTP client support is basic for now but improving (>9.0.0): see related question.
The added bonus is that Twisted is available in the Ubuntu default repository ;-)
For a quick look at package sizes, see
ohloh.net/p/compare .
Of course source size is only a rough metric (what I'd really like is nr pages doc, nr pages examples,
dependencies), but it can help.