I would like to know how to optimize a link-checker which I've implemented as a webservice in Python. I already cache responses to a database which expires every 24 hours. Every night I have a cron job which refreshes the cache so the cache is effectively never out-of-date.
Things slow, of course, if I truncate the cache, or if a page is being checked that has a lot of links not in the cache. I am no computer scientist, so I'd like some advice and some concrete help on how to optimize this using either threads, or processes.
I thought of optimizing by requesting each url as a background process (pseudocodish):
# The part of code that gets response codes not in cache...
responses = {}
# To begin, create a dict of url to process in background
processes = {}
for url in urls:
processes[url] = Popen("curl " url + " &")
# Now loop through again and get the responses
for url in processes
response = processes[url].communicate()
responses[url] = response
# Now I have responses dict which has responses keyed by url
This cuts down the time of the script by at least 1/6 for most of my use-cases as opposed to just looping through each url and waiting for each response, but I am concerned about overloading the server the script is running on. I considered using a queue and having batches of maybe 25 or so at a time.
Would multithreading be better overall solution? If so, how would I do this using the multithreading module?
Related
I have a ANPR (automated number plate reading) system. Essentially a few cameras configured. These make HTTP POSTs to locations we configure. Our cocktail of problems are such:
Our script needs to send this data onto multiple, occasionally slow places.
The camera locks up while it's POSTing.
So if my script takes 15 seconds to complete —it can— we might miss a read.
Here's the cut down version of my script at the moment. Apologies for the 3.4 syntax, we've got some old machines on site.
#asyncio.coroutine
def handle_plate_read(request):
post_data = yield from request.post()
print(post_data)
# slow stuff!
return web.Response()
app = web.Application()
app.router.add_post('/', handle_plate_read)
web.run_app(app, port=LISTEN_PORT)
This is functional but can I push back the 200 to the camera early (and disconnect it) and carry on processing the data, or otherwise easily defer that step until after the camera connection is handled?
If I understood your question correctly, you can of course carry on processing the data right after response or deffer it to process even later.
Here's how simple solution can look like:
Instead of doing slow stuff before response, add data needed to
do it to some Queue with unlimited size and return response
imideately.
Run some Task in background to process slow job that would
grab data from queue. Task itself runs parallely with other
coroutines and doesn't block them. (more info)
Since "slow stuff" is probably something CPU-related, you would
need to use run_in_executor with ProcessPoolExecutor to do
slow stuff in other process(es). (more info)
This should work in basic case.
But you should also think over how will it work under heavy load. For example if you grab data for slow stuff quicky, but process it much slower your queue will grow and you'll run out of RAM.
In that case it makes sense to store your data in DB other then in queue (you would probably need to create separate task to store your data in DB without blocking response). asyncio has drivers for many DB, for example.
I have a particularly large task that takes ~60 seconds to complete. Heroku routers send a timeout error after 30 seconds if nothing is returned, so using a yield statement helps solve that:
def foo():
while not isDone:
print("yield")
yield " "
time.sleep(10)
return Response(foo(), mimetype='text/html')```
(or something similar)
And that works all well and good, except in my case, at the end of my very long task, it makes a decision on where to 302 forward next. It's easy enough to set a forwarding location:
response = Response(foo(), 302, mimetype='text/html')
response.headers['Location'] = '/bar'
return response
except that in this example /bar is static, and I need to assign that dynamically, and only at the end of the very long process.
So is there a way to dynamically assign the forwarding location at the end of the very long async process?
Making sure I'm interpreting this correctly. You have a request that generates something and issues a redirect. The generation takes a long time, triggering the Heroku timeout. You want to get around the time out, somehow.
Can you do this with Flask alone, with the constraint of the Heroku routing tier? Sort of. Note, though, that the simple "keep alive" you are doing fails because it results in an invalid HTTP response. Once you start sending anything that is not a header, you can't send a header (i.e., the redirect).
Your two options -
Polling. Launch an async job, pre-calculate the URL for the job result, but have it guarded by a "monitor" of some kind (e.g., check the job and display "In progress", refreshing every 2-5 seconds until it's done). You have a worker dyno that is used to calculate the result. Ordinarily, I'd say "Redis + Python RQ" for a fast start, but if you can't add any new server-side dependencies, a simple database queue table could suffice.
Pushing. Use an add-on like Pusher. No new server-side dependencies, just an account (which has a low-cost entry option) If that's not an option, roll a WebSocket-based solution.
In general, spending some time to do a good async return will pay off in the long run. It's one of the single best performance enhancements to make to any site - return fast, give the user a responsive experience. You do have to spawn the async task in another process or thread, in order to free up request threads for other responses.
I'm writing a small web server using Flask that needs to do the following things:
On the first request, serve the basic page and kick off a long (15-60 second) data processing task. The data processing task queries a second server which I do not control, updates a local database, and then performs some calculations on the results to show in the web page.
The page issues several AJAX requests that all depend on parts of the result from the long task, so I need to wait until the processing is done.
Subsequent requests for the first page would ideally re-use the previous request's result if they come in while the processing task is ongoing (or even shortly thereafter)
I tried using flask-cache (specifically SimpleCache), but ran into an issue as it seems the cache pickles the result, when I'd really rather keep the exact object.
I suppose I could re-write what I'm caching to be pickle-able, and then implement a single worker thread to do the processing.
Is there some more better way of handling this kind of workflow?
I think best way for long data processing is something like Celery.
Send request to run task and receive task ID.
Periodically send ajax requests to check task progress and receive result of task execution.
I'm currently running a python script against multiple web server. The general task is to find out broken (external) links within a cms. Script runs pretty well so far but in reason I test around 50 internal projects and each with several hundreds sub pages. This ends in several thousands external links i have to check.
For that reason I added multi-threading - improves performance as it was my wish. But here comes the problem. If there is a page to check which contains a list of links to the same server (bundle of known issues or tasks to do) it will slow down the destination system. I neither would like to slow my own server nor server that are not mine.
Currently I running up to 20 threads and than waiting 0.5s until a "thread position" is ready to use. To check if a URL is broken I deal with urlopen(request) coming from urllib2 and log every time it throws an HTTPError. Back to the list of multiple URLs to the same server... my script will "flood" the web server with - cause of multi-threading - up to 20 simultaneous requests.
Just that you have an idea in which dimensions this script runs/URLs have to check: Using only 20 threads "slows" down the current script for only 4 projects to 45min running time. And this is only checking .. Next step will be to check broken URLs for . Using the current script shows us some peaks with 1000ms response time within server monitoring.
Does everyone has an idea how to improve this script in general? Or is there a much better way to check this big amount of URLs? Maybe a counter that pause the thread if there are 10 requests to a single destination?
Thanks for all suggestions
When I was running a crawler, I had all of my URLs prioritized by domain name. Basically, my queue of URLs to crawl was really a queue of domain names, and each domain name had a list of URLs.
When it came time to get the next URL to crawl, a thread would pull a domain name from the queue and crawl the next URL on that domain's list. When done processing that URL, the thread would put the domain on a delay list and remove from the delay list any domains whose delay had expired.
The delay list was a priority queue ordered by expiration time. That way I could give different delay times to each domain. That allowed me to support the crawl-delay extension to robots.txt. Some domains were ok with me hitting their server once per second. Others wanted a one minute delay between requests.
With this setup, I never hit the same domain with multiple threads concurrently, and I never hit them more often than they requested. My default delay was something like 5 seconds. That seems like a lot, but my crawler was looking at millions of domains, so it was never wanting for stuff to crawl. You could probably reduce your default delay.
If you don't want to queue your URLs by domain name, what you can do is maintain a list (perhaps a hash table or the python equivalent) that holds the domain names that are currently being crawled. When you dequeue a URL, you check the domain against the hash table, and put the URL back into the queue if the domain is currently in use. Something like:
goodUrl = false
while (!goodUrl)
url = urlqueue.Dequeue();
lock domainsInUse
if domainsInUse.Contains(url.domainName)
urlqueue.Add(url) // put it back at the end of the queue
else
domainsInUse.Add(url.domainName)
goodUrl = true
That will work, although it's going to be a big CPU pig if the queue contains a lot of URLs from the same domain. For example if you have 20 threads and only 5 different domains represented in the queue, then on average 15 of your threads will be continually spinning, looking for a URL to crawl.
If you only want status make a HEAD request instead of urlopen. This will considerably reduce the load on the server. And of course limit the number of simultaneous requests.
import httplib
from urlparse import urlparse
def is_up(url):
_, host, path, _, _, _ = urlparse(url)
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status < 400
I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...']
def scrape_page(url):
"""Gets the data from the web page."""
body = eventlet.green.urllib2.urlopen(url).read()
# Do something with body
return data
pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
# Handle the data...
However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.
Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.
Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.
Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.
Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.
Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.
Post note
Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:
http://eventlet.net/doc/examples.html#producer-consumer-web-crawler
In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.