There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).
Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response)
Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?
Follow the EAFP principle:
Easier to ask for forgiveness than permission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
Handle an exception and yield a Request to the current url with dont_filter=True:
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler. This is used when you want to perform an
identical request multiple times, to ignore the duplicates filter. Use
it with care, or you will get into crawling loops. Default to False.
def parse(response):
try:
# parsing logic here
except AttributeError:
yield Request(response.url, callback=self.parse, dont_filter=True)
You can also make a copy of the current request (not tested):
new_request = response.request.copy()
new_request.dont_filter = True
yield new_request
Or, make a new request using replace():
new_request = response.request.replace(dont_filter=True)
yield new_request
How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?
In settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scraper.middlewares.retry.RetryMiddleware': 550,
}
Then your retry middleware could be like:
from scrapy.downloadermiddlewares.retry import RetryMiddleware \
as BaseRetryMiddleware
class RetryMiddleware(BaseRetryMiddleware):
def process_response(self, request, response, spider):
# inject retry method so request could be retried by some conditions
# from spider itself even on 200 responses
if not hasattr(spider, '_retry'):
spider._retry = self._retry
return super(RetryMiddleware, self).process_response(request, response, spider)
And then in your success response callback you can call for ex.:
yield self._retry(response.request, ValueError, self)
From Scrapy 2.5.0 there is a new method get_retry_request().
It's pretty easy, the example from the Scrapy docs:
def parse(self, response):
if not response.text:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='empty',
)
return new_request_or_none
in your existing code, you can simply allow duplicate filter=True
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response, dont_filter=True)
Related
I've created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in process_request() after replacing the redirected url with the original one.
This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]} always in place when the requests are sent from the spider.
As all the requests are not being redirected, I tried to replace the redirected urls within _retry() method.
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
def process_exception(self, request, exception, spider):
return self._retry(request, spider)
def _retry(self, request, spider):
request.dont_filter = True
if request.meta.get('redirect_urls'):
redirect_url = request.meta['redirect_urls'][0]
redirected = request.replace(url=redirect_url)
redirected.dont_filter = True
return redirected
return request
def process_response(self, request, response, spider):
if response.status in [301, 302, 307, 429]:
return self._retry(request, spider)
return response
Question: How can I send requests after replacing redirected url with original one using middleware?
Edit:
I'm putting this at the beginning of the answer because it's a quicker one-shot solution that might work for you.
Scrapy 2.5 introduced get_retry_request, that allows you to retry requests from a spider callback.
From the docs:
Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.
So you could do something like:
def parse(self, response):
if response.status in [301, 302, 307, 429]:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='tried to redirect',
max_retry_times = 10
)
if new_request_or_none:
yield new_request_or_none
else:
# exhausted all retries
...
But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.
Edit 2:
On Twisted versions older than 21.7.0, the coroutine async_sleep implementation using deferLater probably won't work. Use this instead:
from twisted.internet import defer, reactor
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
Original answer:
If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?
In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware:
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class CustomRedirectMiddleware(RedirectMiddleware):
"""
Modifies RedirectMiddleware to set response status to 503 on redirects.
Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
(or whatever the downloader middleware responsible for retrying on status 503 is called).
"""
def process_response(self, request, response, spider):
if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list
return response.replace(status=503) # Now this response is RetryMiddleware's problem
return super().process_response(request, response, spider)
However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if, like checking the existence of some header that could indicate site maintenance or something like that.
While we are at it, since you included status code 429 in your list, I assume you may be getting some "Too Many Requests" responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware:
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, callable=None, *args, **kw):
return await task.deferLater(reactor, delay, callable, *args, **kw)
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs
MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
Don't forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES in your project's settings.py:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'your_project_name.middlewares.CustomRedirectMiddleware': 600
}
I've created a script using scrapy to parse the content from a website. The script is doing fine. However, I want that spider to retry when the url being used in the spider gets redirected (leading to some captcha page) and which is why I created a retry middleware.
I tried to understand why this portion or response is in place within process_response() in this line return self._retry(request, reason, spider) or response as I want this very method to retry, not to return response within that block.
This is my current approach:
def _retry(self, request, spider):
check_url = request.url
r = request.copy()
r.dont_filter = True
return r
def process_response(self, request, response, spider):
if ("some_redirected_url" in response.url) and (response.status in RETRY_HTTP_CODES):
return self._retry(request, spider) or response
return response
In this case return x or y is a nice little short cut for
if x:
return x
else:
return y
In the standard RetryMiddleware the _retry method has two branches
if retries <= retry_times:
...
return retryreq
else:
...
The else branch doesn't return anything, and if the method reaches the end without returning then None is returned implicitly. This means that the
return self._retry(request, reason, spider) or response
line evaluates to
return None or response
and as bool(None) is False, response will be returned in this case. If on the other hand the retry_times hasn't been exceeded, _retry will return retryreq which will evaluate True and that will be returned from process_response instead.
In your code _retry always returns a Response and so the or response part will never be reached.
#tomjn had your Middleware's question covered but, as an alternative approach to retrying on those 302s responses, you simply could tell Scrapy to stop redirecting 302s and also add them to the list of codes that triggers the RetryMiddleware. Eg:
from scrapy.utils.project import get_project_settings
RETRY_HTTP_CODES = get_project_settings().get("RETRY_HTTP_CODES", [])
class MySpider(CrawlSpider):
# ...
# do no redirect on this one
handle_httpstatus_list = [302]
# Add "302" to the retry codes list
custom_settings = {"RETRY_HTTP_CODES": RETRY_HTTP_CODES + [302]}
Thus you wouldn't need to have a custom middleware for that.
I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.
My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps
I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).
I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware:
def process_response(request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
return IgnoreRequest()
else:
return response
and I add this to the dict of middlewares:
DOWNLOADER_MIDDLEWARES = {
'acny.middlewares.CustomDownloaderMiddleware': 650
}
with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.
However, when I run the crawler, I get an error saying:
ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'
This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?
I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware(object):
def process_response(self, request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
raise IgnoreRequest()
else:
return response
That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.
If you know which requests are redirected to the problematic ones, how about something like:
def parse_requests(self, response):
....
meta = {'handle_httpstatus_list': [301, 302]}
callback = 'process_redirects'
yield Request(url, callback=callback, meta=meta, ...)
def process_redirects(self, response):
url = response.headers['location']
if url is no good:
return
else:
...
This way you avoid downloading useless responses.
And you can always define your own custom redirect middleware.