Got a situation where I'm going to be parsing websites. each site has to have it's own "parser" and possibly it's own way of dealing with cookies/etc..
I'm trying to get in my head which would be a better choice.
Choice I:
I can create a multiprocessing function, where the (masterspawn) app gets an input url, and in turn it spans a process/function within the masterspawn app that then handles all the setup/fetching/parsing of the page/URL.
This approach would have one master app running, and it in turn creates multiple instances of the internal function.. Should be fast, yes/no?
Choice II:
I could create a "Twisted" kind of server, that would essentially do the same thing as Choice I. The difference being that using "Twisted" would also impose some overhead. I'm trying to evaluate Twisted, with regards to it being a "Server" but i don't need it to perform the fetching of the url.
Choice III:
I could use scrapy. I'm inclined not to go this route as I don't want/need to use the overhead that scrapy appears to have. As i stated, each of the targeted URLs needs its own parse function, as well as dealing with the cookies...
My goal is to basically have the "architected" solution spread across multiple boxes, where each client box interfaces with a master server that allocates the urls to be parsed.
thanks for any comments on this..
-tom
There are two dimensions to this question: concurrency and distribution.
Concurrency: either Twisted or multiprocessing will do the job of concurrently handling fetching/parsing jobs. I'm not sure though where your premise of the "Twisted overhead" comes from. On the contrary, the multiprocessing path would incur much more overhead, since a (relatively heavy-weight) OS-process would have to be spawned. Twisteds' way of handling concurrency is much more light-weight.
Distribution: multiprocessing won't distribute your fetch/parse jobs to different boxes. Twisted can do this, eg. using the AMP protocol building facilities.
I cannot comment on scrapy, never having used it.
For this particular question I'd go with multiprocessing - it's simple to use and simple to understand. You don't particularly need twisted, so why take on the extra complication.
One other option you might want to consider: use a message queue. Have the master drop URLs onto a queue (eg. beanstalkd, resque, 0mq) and have worker processes pickup the URLs and process them. You'll get both concurrency and distribution: you can run workers on as many machines as you want.
Related
I'm working on a web application that will receive a request from a user and have to hit a number of external APIs to compose the answer to that request. This could be done directly from the main web thread using something like gevent to fan out the request.
Alternatively, I was thinking, I could put incoming requests into a queue and use workers to distribute the load. The idea would be to try to keep it real time, while splitting up the requests amongst several workers. Each of these workers would be querying only one of the many external APIs. The response they receive would then go through a series transformations, be saved into a DB, be transformed to a common schema and saved in a common DB to finally be composed into one big response that would be returned through the web request. The web request is most likely going to be blocking all this time, with a user waiting, so keeping
the queueing and dequeueing as fast as possible is important.
The external API calls can easily be turned into individual tasks. I think the linking
from one api task to a transformation to a DB saving task could be done using a chain, etc, and the final result combining all results returned to the web thread using a chord.
Some questions:
Can this (and should this) be done using celery?
I'm using django. Should I try to use django-celery over plain celery?
Each one of those tasks might spawn off other tasks - such as logging what just
happened or other types of branching off. Is this possible?
Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?
Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?
Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?
Is it possible to assign priority to certain tasks?
What about keeping them in order?
Should I skip celery and just use kombu?
It seems like celery is geared more towards "tasks" that can be deferred and are
not time sensitive. Am I nuts for trying to keep this real time?
What other technologies should I look at?
Update: Trying to hash this out a bit more. I did some reading on Kombu and it seems to be able to do what I'm thinking of, although at a much lower level than celery. Here is a diagram of what I had in mind.
What seems to be possible with raw queues as accessible with Kombu is the ability for a number of workers to subscribe to a broadcast message. The type and number does not need to be known by the publisher if using a queue. Can something similar be achieved using Celery? It seems like if you want to make a chord, you need to know at runtime what tasks are going to be involved in the chord, whereas in this scenario you can simply add listeners to the broadcast, and simply make sure they announce they are in the running to add responses to the final queue.
Update 2: I see there is the ability to broadcast Can you combine this with a chord? In general, can you combine celery with raw kombu? This is starting to sound like a question about smoothies.
I will try to answer as many of the questions as possible.
Can this (and should this) be done using celery?
Yes you can
I'm using django. Should I try to use django-celery over plain celery?
Django has a good support for celery and would make the life much easier during development
Each one of those tasks might spawn off other tasks - such as logging
what just happened or other types of branching off. Is this possible?
You can start subtasks from withing a task with ignore_result = true for only side effects
Could tasks be returning the data they get - i.e. potentially Kb of
data through celery (redis as underlying in this case) or should they
write to the DB, and just pass pointers to that data around?
I would suggest putting the results in db and then passing id around would make your broker and workers happy. Less data transfer/pickling etc.
Each task is mostly I/O bound, and was initially just going to use
gevent from the web thread to fan out the requests and skip the whole
queuing design, but it turns out that it would be reused for a
different component. Trying to keep the whole round trip through the
Qs real time will probably require many workers making sure the
queueus are mostly empty. Or is it? Would running the gevent worker
pool help with this?
Since the process is io bound then gevent will definitely help here. However, how much the concurrency should be for gevent pool'd worker, is something that I'm looking for answer too.
Do I have to write gevent specific tasks or will using the gevent pool
deal with network IO automagically?
Gevent does the monkey patching automatically when you use it in pool. But the libraries that you use should play well with gevent. Otherwise, if your parsing some data with simplejson (which is written in c) then that would block other gevent greenlets.
Is it possible to assign priority to certain tasks?
You cannot assign specific priorities to certain tasks, but route them to different queue and then have those queues being listened to by varying number of workers. The more the workers for a particular queue, the higher would be the priority of that tasks on that queue.
What about keeping them in order?
Chain is one way to maintain order. Chord is a good way to summarize. Celery takes care of it, so you dont have to worry about it. Even when using gevent pool, it would at the end be possible to reason about the order of the tasks execution.
Should I skip celery and just use kombu?
You can, if your use case will not change to something more complex over time and also if you are willing to manage your processes through celeryd + supervisord by yourself. Also, if you don't care about the task monitoring that comes with tools such as celerymon, flower, etc.
It seems like celery is geared more towards "tasks" that can be
deferred and are not time sensitive.
Celery supports scheduled tasks as well. If that is what you meant by that statement.
Am I nuts for trying to keep this real time?
I don't think so. As long as your consumers are fast enough, it will be as good as real time.
What other technologies should I look at?
Pertaining to celery, you should choose result store wisely. My suggestion would be to use cassandra. It is good for realtime data (both write and query wise). You can also use redis or mongodb. They come with their own set of problems as result store. But then a little tweaking in configuration can go a long way.
If you mean something completely different from celery, then you can look into asyncio (python3.5) and zeromq for achieving the same. I can't comment more on that though.
i'm facing a design problem within my project.
PROBLEM
i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)
HOW I'M DOING IT NOW
nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist
no threading involved
HOW I'D LIKE TO TO THIS
i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances
WHAT PROBLEMS I'M EXPERIENCING
slowliness: being single threaded, it now takes a lot to complete (when and if it completes)
connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.
python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)
queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers
WHAT I'M USING
python 2.7
mysql
apache-solr
sunburnt (python interface to solr)
linux box
I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.
I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.
Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!
EDIT BASED ON SOME ANSWERS (will keep this updated)
i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues
divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac
what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.
NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)
Thanks all again!
Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.
You can use the stdlib "multiprocessing" module in order to have several subprocesses working with your combinations - This works better than Python's threads, and allow at least each logical CPU core in your configuration to run at the same time.
Here is a minimalist example of how it works:
import random
from multiprocessing import Pool
def a(a):
if random.randint(0, 100000) == 0:
return True
return False
# the number bellow should be a equal to your number of processor cores:
p = Pool(4)
x = any(p.map(a, xrange(1000000)))
print x
So, this makes a 10 million test, divided in 4 "worker" processes, with no scaling issues.
However, given the nature of the error messages you are getting, though you don't explicitly says so, you seem to be running an application with a web interface - and you wait for all the processing to finish before rendering a result to the browser. This tipically won't work with long running calculations - you'd better perform all your calculations in a separate process than the server process serving your web interface, and update the web interface via asynchronous requests, using a little javascript. That way you will avoid any "connection reset by peer" errors.
I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!
One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.
The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.
I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.
My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?
I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/
I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.
My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.
You can have a look at celery
It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.