I'm writing a simple site spider and I've decided to take this opportunity to learn something new in concurrent programming in Python. Instead of using threads and a queue, I decided to try something else, but I don't know what would suit me.
I have heard about Stackless, Celery, Twisted, Tornado, and other things. I don't want to have to set up a database and the whole other dependencies of Celery, but I would if it's a good fit for my purpose.
My question is: What is a good balance between suitability for my app and usefulness in general? I have taken a look at the tasklets in Stackless but I'm not sure that the urlopen() call won't block or that they will execute in parallel, I haven't seen that mentioned anywhere.
Can someone give me a few details on my options and what would be best to use?
Thanks.
Tornado is a web server, so it wouldn't help you much in writing a spider. Twisted is much more general (and, inevitably, complex), good for all kinds of networking tasks (and with good integration with the event loop of several GUI frameworks). Indeed, there used to be a twisted.web.spider (but it was removed years ago, since it was unmaintained -- so you'll have to roll your own on top of the facilities Twisted does provide).
I must say that Twisted gets my vote.
Performing event-drive tasks is fairly straightforward in Twisted. Integration with other important system components such as GTK+ and DBus is very easy.
The HTTP client support is basic for now but improving (>9.0.0): see related question.
The added bonus is that Twisted is available in the Ubuntu default repository ;-)
For a quick look at package sizes, see
ohloh.net/p/compare .
Of course source size is only a rough metric (what I'd really like is nr pages doc, nr pages examples,
dependencies), but it can help.
Related
I am working on a project which is I/O bound.
I have 3 dependent tasks:
1. scraping a site + extracting the main content(removing comments/ads etc)
2. as soon as 1 completes it sends the data to a summerizer
3. as soon as 2 completes it calls a view and renders a page
I know Python and Django at the moment. What technologies do you recommend me for this project? (I know that Python + Twisted or node.js are ideal for I/O bound projects).
If you're already using Python, you're probably better off sticking with a Python library, especially when there are so many powerful asynchronous Python libraries. Node.js is fine, but switching between Python and Javascript is unnecessary.
Anyway, your question is very very vague. You can absolutely use Twisted and it will probably do what you want just fine, as long as you learn the API well enough. Other asynchronous frameworks include gevent and a web server called Tornado.
There's also Celery which is used specifically for asynchronous processing of queues. It may or may not be helpful to what you want.
I recommend you do a lot of research, look at the documentation of the above libraries, and decide what'll fit your project best. If you have more specific questions you can ask the respective IRC channels of the library, or post a clearer question here.
I am finally using django-socketio.
https://github.com/stephenmcd/django-socketio
In case websockets are not supported, socketio falls back to long polling.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm increasingly hearing that Python's Twisted framework rocks and other frameworks pale in comparison.
Can anybody shed some light on this and possibly compare Twisted with other network programming frameworks.
There are a lot of different aspects of Twisted that you might find cool.
Twisted includes lots and lots of protocol implementations, meaning that more likely than not there will be an API you can use to talk to some remote system (either client or server in most cases) - be it HTTP, FTP, SMTP, POP3, IMAP4, DNS, IRC, MSN, OSCAR, XMPP/Jabber, telnet, SSH, SSL, NNTP, or one of the really obscure protocols like Finger, or ident, or one of the lower level protocol-building-protocols like DJB's netstrings, simple line-oriented protocols, or even one of Twisted's custom protocols like Perspective Broker (PB) or Asynchronous Messaging Protocol (AMP).
Another cool thing about Twisted is that on top of these low-level protocol implementations, you'll often find an abstraction that's somewhat easier to use. For example, when writing an HTTP server, Twisted Web provides a "Resource" abstraction which lets you construct URL hierarchies out of Python objects to define how requests will be responded to.
All of this is tied together with cooperating APIs, mainly due to the fact that none of this functionality is implemented by blocking on the network, so you don't need to start a thread for every operation you want to do. This contributes to the scalability that people often attribute to Twisted (although it is the kind of scalability that only involves a single computer, not the kind of scalability that lets your application grow to use a whole cluster of hosts) because Twisted can handle thousands of connections in a single thread, which tends to work better than having thousands of threads, each for a single connection.
Avoiding threading is also beneficial for testing and debugging (and hence reliability in general). Since there is no pre-emptive context switching in a typical Twisted-based program, you generally don't need to worry about locking. Race conditions that depend on the order of different network events happening can easily be unit tested by simulating those network events (whereas simulating a context switch isn't a feature provided by most (any?) threading libraries).
Twisted is also really, really concerned with quality. So you'll rarely find regressions in a Twisted release, and most of the APIs just work, even if you aren't using them in the common way (because we try to test all the ways you might use them, not just the common way). This is particularly true for all of the code added to Twisted (or modified) in the last 3 or 4 years, since 100% line coverage has been a minimum testing requirement since then.
Another often overlooked strength of Twisted is its ten years of figuring out different platform quirks. There are lots of undocumented socket errors on different platforms and it's really hard to learn that they even exist, let alone handle them. Twisted has gradually covered more and more of these, and it's pretty good about it at this point. Younger projects don't have this experience, so they miss obscure failure modes that will probably only happen to users of any project you release, not to you.
All that say, what I find coolest about Twisted is that it's a pretty boring library that lets me ignore a lot of really boring problems and just focus on the interesting and fun things. :)
Well it's probably according to taste.
Twisted allows you to easily create event driven network servers/clients, without really worrying about everything that goes into accomplishing this. And thanks to the MIT License, Twisted can be used almost anywhere. But I haven't done any benchmarking so I have no idea how it scales, but I'm guessing quite good.
Another plus would be the Twisted Projects, with which you can quickly see how to implement most of the server/services that you would want to.
Twisted also has some great documentation, when I started with it a couple of weeks ago I was able to quickly get a working prototype.
Quite new to the python scene please correct me if i'm in the wrong.
http://www.artima.com/weblogs/viewpost.jsp?thread=208528
Bruce Eckel talked about using Flex and Python together. Since then, we have had PyAMF and the likes.
It has been almost three years, but googling does not reveal much more than a bunch of articles/comments linking to that article above (or related ones). There is no buzz, no excitement. Not much on SO either.
I am thinking of attempting something using Flex/Python which would require me to be heavily invested in it. What I worry about is that the support system is very weak and activity is almost nonexistent.
I really want to do this. Can anyone direct me towards some useful resource?
An application written in Flex/Flash is server agnostic...and it should be easy to replace the server side language with another one. The client application will consume some web services exposed by the server(REST/SOAP), or it can use as an alternative remote method invocation. The last one is implemented for the most important languages, from what I know.
There are some exceptions..if you want to use messaging the professional solutions are offered mainly by the frameworks build on top of Java.
So if you do not rely heavy on messaging the heavily investment is going to be mainly of the client side, especially if you haven't worked before with the so called "fat" clients. But not on the integration side..it not so complicated.
Regarding useful Flex resources, my suggestion is to take a look at http://www.adobe.com/devnet/flex.html
We're using Twisted extensively for apps requiring a great deal of asynchronous io. There are some cases where stuff is cpu bound instead and for that we spawn a pool of processes to do the work and have a system for managing these across multiple servers as well - all done in Twisted. Works great. The problem is that it's hard to bring new team members up to speed. Writing asynchronous code in Twisted requires a near vertical learning curve. It's as if humans just don't think that way naturally.
We're considering a mixed approach perhaps. Maybe keep the xmlrpc server part and process management in Twisted and implement the other stuff in code that at least looks synchronous to some extent while not being as such. Then again I like explicit over implicit so I have to think about this a bit more. Anyway onto greenlets - how well does that stuff work? So there's Stackless and as you can see from my Gallentean avatar I'm well aware of the tremendous success in it's use for CCP's flagship EVE Online game first hand. What about Eventlet or gevent? Well for now only Eventlet works with Twisted. However gevent claims to be faster as it's not a pure python implementation but rather relies upon libevent instead. It also claims to have fewer idiosyncrasies and defects. gevent It's maintained by 1 guy as far as I can tell. This makes me somewhat leery but all great projects start this way so... Then there's PyPy - I haven't even finished reading about that one yet - just saw it in this thread: Drawbacks of Stackless.
So confusing - I'm wondering what the heck to do - sounds like Eventlet is probably the best bet but is it really stable enough? Anyone out there have any experience with it? Should we go with Stackless instead as it's been around and is proven technology - just like Twisted is as well - and they do work together nicely. But still I hate having to have a separate version of Python to do this. what to do....
This somewhat obnoxious blog entry hit the nail on the head for me though: Asynchronous IO for Grownups I don't get the Twisted is being like Java remark as to me Java is typically where you are in the threading mindset but whatever. Nevertheless if that monkey patch thing really works just like that then wow. Just wow!
You might want to check out:
Comparing gevent to eventlet
Reports from users who moved from twisted or eventlet to gevent
Eventlet and gevent are not really comparable to Stackless, because Stackless ships with a standard library that is not aware of tasklets. There are implementations of socket for Stackless but there isn't anything as comprehensive as gevent.monkey. CCP does not use bare bones Stackless, it has something called Stackless I/O which AFAIK is windows-only and was never open sourced (?).
Both eventlet and gevent could be made to run on Stackless rather than on greenlet. At some point we even tried to do this as a GSoC project but did not find a student.
Answering part of your question - if you look at http://speed.pypy.org you'll see that using twisted on top of PyPy may give you some speedups. This depends of course on your workload, but it's probably worth checking out.
Cheers,
fijal
I've built a little real time web app on top of eventlet and repoze.bfg (I gave up on django quite a while ago). I've found eventlet and monkey patching to be just as easy as Ted says.
Gevent isn't pure Python, and it strictly depends on CPython.
From web frameworks you mentioned Eventlet (OpenStack) and Tornado (FriendsFeed, Quora) has the biggest deploy.
UPDATE: after much laboring with Py3, including writing my own asynchronous webserver (following a presentation given by Dave Beazley), i finally dumped Python (and a huge stack of my code )-: in favor of CoffeeScript running on NodeJS. Check it out: GitHub (where you'll find like 95% of all interesting code these days), npm (package manager that couldn't be any user friendly; good riddance, easy_install, you never lived up to your name), an insanely huge repository of modules (with tons of new stuff being published virtually 24/7), a huge and vibrant community, out-of-the-box asynchronous HTTP and filehandling..., all that (thanks to V8) at one third the speed of light — what's not to like? read more propaganda: "The future of Scripting" (slide hosting courtesy SpreeWebdesign).
I am looking for a way to serve HTTP (and do HTTP requests) in an asynchronous, non-blocking fashion. This seems to be hard to do when you’ve decided on Stackless Python 3.1 (also see here for docs) as i did.
There are some basic examples, like the pretty informative and detailed article How To Use Linux epoll with Python, and there is a a Google code project named stacklessexamples which contains some valuable information (but no Python 3.x compatible code).
So, after many days of doing research on the web and trying to put together the pieces i’ve found so far: does anyone know of a fairly usable asynchronous HTTP library? It doesn’t have to be WSGI-compliant (I am not interested in that).
The server part should be able to serve multiple non-blocking HTTP requests (and possibly do the basics of HTTP header processing); the HTTP client part should be able to retrieve, in a non-blocking way, web content via HTTP requests (also doing basic header processing, but no fancy stuff like authorization or so).
My research so far has shown me that non-blocking HTTP
is the only way that makes sense in a stackless, cooperatively scheduled environment;
is feasible in Stackless Python 3 by virtue of the standard library’s select epoll (introduced in Py2.6; some solutions prefer libevent, but that means another hurdle as the pyevent project seems to have stopped developing at Py2.5);
is sadly still not a household item, with most people relying on blocking HTTP.
The way it looks like now, i would have to learn the basics of socket programming and roll my own HTTP server/client library. I still shy away from that task as i have very little background in that area and am bound to ‘repeat history’ that way.
I would be very happy about any relevant pointers. I prefer very much solutions that make use of select.epoll; i seem to remember it is much more scalable that the older asyncore (but maybe someone has more precise data on this). As a minimum requirement, solutions should run on Ubuntu 9.10.
I know this is like resurrecting the dead (and flow has probably long since solved his problem), but for completeness stackless is available for 3.1.3:
http://www.stackless.com/download
For information on implementing a HTTP server using stacklesssocket:
http://code.google.com/p/stacklessexamples/wiki/StacklessNetworking
Non blocking HTTP case is very well handled with twisted, what is does is creating a series of callbacks, and registering those callbacks with deferred. Twisted documentation is worth checking out. Stackless uses microthreads but twisted is coding the entire web framework using fragment by fragment non bloking code chained with callbacks, errbacks and deferreds running is a main reactor loop over a single thread. Think this should the Async HTTP thing better.