I have a conceptual question.
I currently have a programme that performs within a never ending loop.
Def (mycode):
Perform login to server and retrieve cookies etc
While 1:
Perform an URL request (with custom headers, cookies etc)
Process the reply
Perform URL requests dependent upon the values in replies
Process reply
I am happy for this to continue as it is, as the URL's must be called one after the other.
Now the server limits a single account to a limited number of functions, it would be useful to be able to perform this function with two (or more) different accounts.
My question is: Is this possible to do? I have done a reasonable amount of reading on queues and multithreading, if you nice people could suggest a method with a good (easy to understand) example I would be most appreciative.
Gevent is a performant green threads implementation that has example.
I'm unsure if by doing this for different accounts to the same server you mean having different worker functions handling the url processing - in effect having Def (mycode) n times for each account. Perhaps you could expand on the detail.
>>> import gevent
>>> from gevent import socket
>>> urls = ['www.google.com', 'www.example.com', 'www.python.org']
>>> jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
>>> gevent.joinall(jobs, timeout=2)
>>> [job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']
In addition you could break up the problem by using something like beanstalkd which would allow you to run your main process 'n' times for each account and put the results on a beanstalk-queue for processing by another process.
Saves having to deal with threading which is always a good thing in non-trivial applications.
Related
Let's say I make 5 requests via a requests.Session to a server, using a ThreadPoolExecutor:
session = requests.Session()
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
def post(data):
response = mysession.post('http://example.com/api/endpoint1', data)
return response
for data in (data1, data2, data3, data4, data5):
executor.submit(post, data)
Since we are using the same requests.Session for each request, do we have to wait for the server to acknowledge the first request before we can send the next one?
If I had 5 sessions open concurrently -- one session per thread -- would I be able to send the requests more rapidly by sending each request via its own session?
The maintainer already recommends "one session per thread" so it's certainly doable... but will it improve performance?
Would I be better off using aiohttp and async?
So, first of all if you are not sure whether certain object/function is thread safe you should assume that it is not. Therefore you should not use Session objects in multiple threads without appropriate locking.
As for performance: always measure. Many libraries tend to do lots of stuff under the hood, including opening multiple TCP connections. They can probably be configured to tune performance, so its very hard to answer the question precisely. Especially since we don't know your case. For example if you intend to make 5 parallel requests, then simply run 5 threads with 5 session objects. Most likely you won't see a diffrence between libs (unless you pick a really bad one). On the other hand if you are looking at hundreds or thousands concurrent requests it will matter.
Anyway: always measure it yourself.
My development stack is Django/Python. (that includes Django-REST-framework)
I am currently looking for ways to do multiple distinct API calls.
client.py
def submit_order(list_of_orders):
//submit each element in list_of_orders in thread
for order in list_of_orders:
try:
response=request.POST(order, "www.confidential.url.com")
except:
//retry_again
else:
if !response.status==200:
//retry_again
In the above method, I am currently submitting order one by one, I want to submit all orders at once. Secondly, I want to retry submission for x times if it fails.
I currently do not know how well to achieve it.
I am looking for ways that python libraries or Django application provide rather than re-inventing the wheel.
Thanks
As #Selcuk said you can try django-celery which is a recommended approach in my opinion, but you will need to make some configuration and read some manuals.
On the other hand, you can try using multiprocessing like this:
from multiprocessing import Pool
def process_order(order):
#Handle each order here doing requests.post and then retrying if neccesary
pass
def submit_order(list_of_orders):
orders_pool = Pool(len(list_of_orders))
results = orders_pool.map(process_order, list_of_orders)
#Do something with the results here
It will depend on what you need to get done, if you can do the requests operations on the background and your api user can be notified later, just use django-celery and then notify the user accordingly, but if you want a simple approach to react immediately, you can use the one I "prototyped" for you.
You should consider some kind of delay on the responses for your requests (as you are doing some POST request). So make sure your POST request don't grow a lot, because it could affect the experience of the API clients calling your services.
I am new to python, and even newer to twisted. I am trying to use twisted to download a few hundred thousand files but am having trouble trying to add an errback. I'd like to print the bad url if the download fails. I've misspelled one of my urls on purpose in order to throw an error. However, the code I have just hangs and python doesn't finish (it finishes fine if I remove the errback call).
Also, how to I process each file individually? From my understanding, "finish" is called when everything completes. I'd like to gzip each file when it's downloaded so that it's removed from memory.
Here's what I have:
urls = [
'http://www.python.org',
'http://stackfsdfsdfdsoverflow.com', # misspelled on purpose to generate an error
'http://www.twistedmatrix.com',
'http://www.google.com',
'http://launchpad.net',
'http://github.com',
'http://bitbucket.org',
]
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
def print_badurls(err):
print err # how do I just print the bad url????????
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish).addErrback(print_badurls)
reactor.run()
Welcome to Python and Twisted!
There are a few problems with the code you pasted. I'll go through them one at a time.
First, if you do want to download thousands of urls, and will have thousands of items in the urls list, then this line:
waiting = [client.getPage(url) for url in urls]
is going to cause problems. Do you want to try to download every page in the list simultaneously? By default, in general, things you do in Twisted happen concurrently, so this loop starts downloading every URL in the urls list at once. Most likely, this isn't going to work. Your DNS server is going to drop some of the domain lookup requests, your DNS client is going to drop some of the domain lookup responses. The TCP connection attempts to whatever addresses you do get back will compete for whatever network resources are still available, and some of them will time out. The rest of the connections will all trickle along, sharing available bandwidth between dozens or perhaps hundreds of different downloads.
Instead, you probably want to limit the degree of concurrency to perhaps 10 or 20 downloads at a time. I wrote about one approach to this on my blog a while back.
Second, gatherResults returns a Deferred that fires as soon as any one Deferred passed to it fires with a failure. So as soon as any one client.getPage(url) fails - perhaps because of one of the problems I mentioned above, or perhaps because the domain has expired, or the web server happens to be down, or just because of an unfortunate transient network condition, the Deferred returned by gatherResults will fail. finish will be skipped and print_badurls will be called with the error describing the single failed getPage call.
To handle failures from individual HTTP requests, add the callbacks and errbacks to the Deferreds returned from the getPage calls. After adding those callbacks and errbacks, you can use defer.gatherResults to wait for all of the downloads and processing of the download results to be complete.
Third, you might want to consider using a higher-level tool for this - scrapy is a web crawling framework (based on Twisted) that provides lots of cool useful helpers for this kind of application.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Asynchronous HTTP calls in Python
I have a Django view which needs to retrieve search results from multiple web services, blend the results together, and render them. I've never done any multithreading in Django before. What is a modern, efficient, safe way of doing this?
I don't know anything about it yet, but gevent seems like a reasonable option. Should I use that? Does it play well with Django? Should I look elsewhere?
Not sure about gevent. The simplest way is to use threads[*]. Here's a simple example of how to use threads in Python:
# std lib modules. "Batteries included" FTW.
import threading
import time
thread_result = -1
def ThreadWork():
global thread_result
thread_result = 1 + 1
time.sleep(5) # phew, I'm tiered after all that addition!
my_thread = threading.Thread(target=ThreadWork)
my_thread.start() # This will call ThreadWork in the background.
# In the mean time, you can do other stuff
y = 2 * 5 # Completely independent calculation.
my_thread.join() # Wait for the thread to finish doing it's thing.
# This should take about 5 seconds,
# due to time.sleep being called
print "thread_result * y =", thread_result * y
You can start multiple threads, have each make different web service calls, and join on all of those threads. Once all those join calls have returned, the results are in, and you'll be able to blend them.
more advanced tips: You should call join with a timeout; otherwise, your users might be waiting indefinitely for your app to send them a response. Even better would be for you to make those web service calls before the request arrives at your app; otherwise, the responsiveness of your app is at the mercy of the services that you rely on.
caveat about threading in general: Be careful with data that can be accessed by two (or more) different threads. Access to the same data needs to be "synchronized". The most popular synchronization device is a lock, but there is a plethora of others. threading.Lock implements a lock. If you're not careful about synchronization, you're likely to write a "race condition" into your app. Such bugs are notoriously difficult to debug, because they cannot be reliably reproduced.
In my simple example, thread_result was shared between my_thread and the main thread. I didn't need any locks, because the main thread did not access thread_result until my_thread terminated. If I hadn't called my_thread.join, the result would some times be -10 instead of 20. Go ahead and try it yourself.
[*] Python doesn't have true threading in the sense that concurrent threads do not execute simulatneously, even if you have idle cores. However, you still get concurrent execution; when one thread is blocked, other threads can execute.
I just nicely solved this problem using futures, available in 3.2 and backported to earlier versions including 2.x.
In my case I was retrieving results from an internal service and collating them:
def _getInfo(request,key):
return urllib2.urlopen(
'http://{0[SERVER_NAME]}:{0[SERVER_PORT]}'.format(request.META) +
reverse('my.internal.view', args=(key,))
, timeout=30)
…
with futures.ThreadPoolExecutor(max_workers=os.sysconf('SC_NPROCESSORS_ONLN')) as executor:
futureCalls = dict([ (
key,executor.submit(getInfo,request,key)
) for key in myListOfItems ])
curInfo = futureCalls[key]
if curInfo.exception() is not None:
# "exception calling for info: {0}".format(curInfo.exception())"
else:
# Handle the result…
gevent will not help you to process the task faster. It is just more efficient than threads when it comes to resource footprint. When running gevent with Django (usually via gunicorn) your web app will be able to handle more concurrent connections than a normal django wsgi app.
But: I think this has nothing to do with your problem. What you want to do is handle a huge task in one Django view, which is usually not a good idea. I personally advise you against using threads or gevents greenlets for this in Django. I see the point for standalone Python scripts or daemon's or other tools, but not for web. This mostly results in instability and more resource footprint. Instead I am agreeing with the comments of dokkaebi and Andrew Gorcester. Both comments differ somehow though, since it really depends of what your task is about.
If you can split your task into many smaller tasks you could create multiple views handling these subtasks. These views could return something like JSON and can be consumed via AJAX from your frontend. Like this you can build the content of your page as it "comes in" and the user does not need to wait until the whole page is loaded.
If you task is one huge chunk you are better off with a task queue handler. Celery comes in mind. If Celery is too overkill you can use zeroMQ. This basically works like mentioned above from Andrew: you schedule the task for processing and are polling the backend from your frontend page until the task is finished (usually also via AJAX). You could also use something like long polling here.
I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...']
def scrape_page(url):
"""Gets the data from the web page."""
body = eventlet.green.urllib2.urlopen(url).read()
# Do something with body
return data
pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
# Handle the data...
However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.
Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.
Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.
Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.
Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.
Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.
Post note
Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:
http://eventlet.net/doc/examples.html#producer-consumer-web-crawler
In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.