I'm currently running a python script against multiple web server. The general task is to find out broken (external) links within a cms. Script runs pretty well so far but in reason I test around 50 internal projects and each with several hundreds sub pages. This ends in several thousands external links i have to check.
For that reason I added multi-threading - improves performance as it was my wish. But here comes the problem. If there is a page to check which contains a list of links to the same server (bundle of known issues or tasks to do) it will slow down the destination system. I neither would like to slow my own server nor server that are not mine.
Currently I running up to 20 threads and than waiting 0.5s until a "thread position" is ready to use. To check if a URL is broken I deal with urlopen(request) coming from urllib2 and log every time it throws an HTTPError. Back to the list of multiple URLs to the same server... my script will "flood" the web server with - cause of multi-threading - up to 20 simultaneous requests.
Just that you have an idea in which dimensions this script runs/URLs have to check: Using only 20 threads "slows" down the current script for only 4 projects to 45min running time. And this is only checking .. Next step will be to check broken URLs for . Using the current script shows us some peaks with 1000ms response time within server monitoring.
Does everyone has an idea how to improve this script in general? Or is there a much better way to check this big amount of URLs? Maybe a counter that pause the thread if there are 10 requests to a single destination?
Thanks for all suggestions
When I was running a crawler, I had all of my URLs prioritized by domain name. Basically, my queue of URLs to crawl was really a queue of domain names, and each domain name had a list of URLs.
When it came time to get the next URL to crawl, a thread would pull a domain name from the queue and crawl the next URL on that domain's list. When done processing that URL, the thread would put the domain on a delay list and remove from the delay list any domains whose delay had expired.
The delay list was a priority queue ordered by expiration time. That way I could give different delay times to each domain. That allowed me to support the crawl-delay extension to robots.txt. Some domains were ok with me hitting their server once per second. Others wanted a one minute delay between requests.
With this setup, I never hit the same domain with multiple threads concurrently, and I never hit them more often than they requested. My default delay was something like 5 seconds. That seems like a lot, but my crawler was looking at millions of domains, so it was never wanting for stuff to crawl. You could probably reduce your default delay.
If you don't want to queue your URLs by domain name, what you can do is maintain a list (perhaps a hash table or the python equivalent) that holds the domain names that are currently being crawled. When you dequeue a URL, you check the domain against the hash table, and put the URL back into the queue if the domain is currently in use. Something like:
goodUrl = false
while (!goodUrl)
url = urlqueue.Dequeue();
lock domainsInUse
if domainsInUse.Contains(url.domainName)
urlqueue.Add(url) // put it back at the end of the queue
else
domainsInUse.Add(url.domainName)
goodUrl = true
That will work, although it's going to be a big CPU pig if the queue contains a lot of URLs from the same domain. For example if you have 20 threads and only 5 different domains represented in the queue, then on average 15 of your threads will be continually spinning, looking for a URL to crawl.
If you only want status make a HEAD request instead of urlopen. This will considerably reduce the load on the server. And of course limit the number of simultaneous requests.
import httplib
from urlparse import urlparse
def is_up(url):
_, host, path, _, _, _ = urlparse(url)
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status < 400
Related
I have a function that is meant to check if a specific HTTP(S) URL is a redirect and if so return the new location (but not recursively). It uses the requests library. It looks like this:
try:
response = http_session.head(sent_url, timeout=(1, 1))
if response.is_redirect:
return response.headers["location"]
return sent_url
except requests.exceptions.Timeout:
return sent_url
Here, the URL I am checking is sent_url. For reference, this is how I create the session:
http_session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=0)
http_session.mount("http://", http_adapter)
http_session.mount("https://", http_adapter)
However, one of the requirements of this program is that this must work for dead links. Based off of this, I set a connection timeout (and read timeout for good measures). After playing around with the values, it still takes about 5-10 seconds for the request to fail with this stacktrace no matter what value I choose. (Maybe relevant: in the browser, it gives DNS_PROBE_POSSIBLE.)
Now, my problem is: 5-10 seconds is too long to wait for if a link is dead. There are many links that this program needs to check, and I do not want a few dead links to be such a large bottleneck, hence I want to configure this DNS lookup timeout.
I found this post which seems to be relevant (OP wants to increase the timeout, I want to decrease it) however the solution does not seem applicable. I do not know the IP addresses that these URLs point to. In addition, this feature request from years ago seems relevant, but it did not help me further.
So far, the best solution to me seems to just spin up a coroutine for each link/a batch of links and then suck up the timeout asynchronously.
I am on Windows 10, however this code will be deployed on an Ubuntu server. Both use Python 3.8.
So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?
So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?
Separate things.
Use urllib.parse to extract the hostname from the URL, and then use dnspython to resolve that name, with whatever timeout you want.
Then, and only if the resolution was correct, fire up requests to grab the HTTP data.
#blurfus: in requests you can only use the timeout parameter in the HTTP call, you can't attach it to a session. It is not spelled out explicitly in the documentation, but the code is quite clear on that.
There are many links that this program needs to check,
That is a completely separate problem in fact, and exists even if all links are ok, it is just a problem of volume.
The typical solutions fell in two cases:
use asynchronous libraries (they exist for both DNS and HTTP), where your calls are not blocking, you get the data later, so you are able to do something else
use multiprocessing or multithreading to parallelize things and have multiple URLs being tested at the same time by separate instances of your code.
They are not completely mutually exclusive, you can find a lot of pros and cons for each, asynchronous codes may be more complicated to write and understand later, so multiprocessing/multithreading is often the first step for a "quick win" (especially if you do not need to share anything between the processes/threads, otherwise it becomes quickly a problme), yet asynchronous handling of everything makes the code scales more nicely with the volume.
Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.
I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.
Thanks!
Updated:
I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads)
I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.
There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.
If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.
Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.
Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.
Here is a caveat. I have encountered a number of servers powered by somewhat "elderly" releases of IIS. They often will not service a request if there is not a one second delay between requests.
I would like to know how to optimize a link-checker which I've implemented as a webservice in Python. I already cache responses to a database which expires every 24 hours. Every night I have a cron job which refreshes the cache so the cache is effectively never out-of-date.
Things slow, of course, if I truncate the cache, or if a page is being checked that has a lot of links not in the cache. I am no computer scientist, so I'd like some advice and some concrete help on how to optimize this using either threads, or processes.
I thought of optimizing by requesting each url as a background process (pseudocodish):
# The part of code that gets response codes not in cache...
responses = {}
# To begin, create a dict of url to process in background
processes = {}
for url in urls:
processes[url] = Popen("curl " url + " &")
# Now loop through again and get the responses
for url in processes
response = processes[url].communicate()
responses[url] = response
# Now I have responses dict which has responses keyed by url
This cuts down the time of the script by at least 1/6 for most of my use-cases as opposed to just looping through each url and waiting for each response, but I am concerned about overloading the server the script is running on. I considered using a queue and having batches of maybe 25 or so at a time.
Would multithreading be better overall solution? If so, how would I do this using the multithreading module?
I am new to python, and even newer to twisted. I am trying to use twisted to download a few hundred thousand files but am having trouble trying to add an errback. I'd like to print the bad url if the download fails. I've misspelled one of my urls on purpose in order to throw an error. However, the code I have just hangs and python doesn't finish (it finishes fine if I remove the errback call).
Also, how to I process each file individually? From my understanding, "finish" is called when everything completes. I'd like to gzip each file when it's downloaded so that it's removed from memory.
Here's what I have:
urls = [
'http://www.python.org',
'http://stackfsdfsdfdsoverflow.com', # misspelled on purpose to generate an error
'http://www.twistedmatrix.com',
'http://www.google.com',
'http://launchpad.net',
'http://github.com',
'http://bitbucket.org',
]
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
def print_badurls(err):
print err # how do I just print the bad url????????
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish).addErrback(print_badurls)
reactor.run()
Welcome to Python and Twisted!
There are a few problems with the code you pasted. I'll go through them one at a time.
First, if you do want to download thousands of urls, and will have thousands of items in the urls list, then this line:
waiting = [client.getPage(url) for url in urls]
is going to cause problems. Do you want to try to download every page in the list simultaneously? By default, in general, things you do in Twisted happen concurrently, so this loop starts downloading every URL in the urls list at once. Most likely, this isn't going to work. Your DNS server is going to drop some of the domain lookup requests, your DNS client is going to drop some of the domain lookup responses. The TCP connection attempts to whatever addresses you do get back will compete for whatever network resources are still available, and some of them will time out. The rest of the connections will all trickle along, sharing available bandwidth between dozens or perhaps hundreds of different downloads.
Instead, you probably want to limit the degree of concurrency to perhaps 10 or 20 downloads at a time. I wrote about one approach to this on my blog a while back.
Second, gatherResults returns a Deferred that fires as soon as any one Deferred passed to it fires with a failure. So as soon as any one client.getPage(url) fails - perhaps because of one of the problems I mentioned above, or perhaps because the domain has expired, or the web server happens to be down, or just because of an unfortunate transient network condition, the Deferred returned by gatherResults will fail. finish will be skipped and print_badurls will be called with the error describing the single failed getPage call.
To handle failures from individual HTTP requests, add the callbacks and errbacks to the Deferreds returned from the getPage calls. After adding those callbacks and errbacks, you can use defer.gatherResults to wait for all of the downloads and processing of the download results to be complete.
Third, you might want to consider using a higher-level tool for this - scrapy is a web crawling framework (based on Twisted) that provides lots of cool useful helpers for this kind of application.
I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...']
def scrape_page(url):
"""Gets the data from the web page."""
body = eventlet.green.urllib2.urlopen(url).read()
# Do something with body
return data
pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
# Handle the data...
However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.
Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.
Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.
Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.
Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.
Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.
Post note
Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:
http://eventlet.net/doc/examples.html#producer-consumer-web-crawler
In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.