I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429).
Also, I should do this without using time.sleep(), and use middlewares or extensions.
Here is my middlewares:
from scrapy import signals
from pydispatch import dispatcher
class Handle429:
def __init__(self):
dispatcher.connect(self.item_scraped, signal=signals.item_scraped)
def item_scraped(self, item, spider, response):
if response.status == 429:
print("THIS IS 429 RESPONSE")
#
# here stop spider for 10 minutes and then continue
#
I read about self.crawler.engine.pause() but how can I implement it in my middleware, and set a custom time for pause?
Or is there another way to do this? Thanks.
I have solved my problem. First of all, middleware can have default foo like process_response or process_request.
In settings.py
HTTPERROR_ALLOWED_CODES = [404]
Then, I have changed my middleware class:
from twisted.internet import reactor
from twisted.internet.defer import Deferred
#replace class Handle429
class HandleErrorResponse:
def __init__(self):
self.time_pause = 1800
def process_response(self, request, response, spider):
# this foo called by default before the spider
pass
Then I find a code that helps me to pause spider without time.sleep()
#in HandleErrorResponse
def process_response(self, request, response, spider):
print(response.status)
if response.status == 404:
d = Deferred()
reactor.callLater(self.time_pause, d.callback, response)
return response
And it's work.
I can't fully explain how reactor.callLater() works, but I think it just stops the event loop in scrapy, and then your response will be sent to the spider.
Related
I've created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in process_request() after replacing the redirected url with the original one.
This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]} always in place when the requests are sent from the spider.
As all the requests are not being redirected, I tried to replace the redirected urls within _retry() method.
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
def process_exception(self, request, exception, spider):
return self._retry(request, spider)
def _retry(self, request, spider):
request.dont_filter = True
if request.meta.get('redirect_urls'):
redirect_url = request.meta['redirect_urls'][0]
redirected = request.replace(url=redirect_url)
redirected.dont_filter = True
return redirected
return request
def process_response(self, request, response, spider):
if response.status in [301, 302, 307, 429]:
return self._retry(request, spider)
return response
Question: How can I send requests after replacing redirected url with original one using middleware?
Edit:
I'm putting this at the beginning of the answer because it's a quicker one-shot solution that might work for you.
Scrapy 2.5 introduced get_retry_request, that allows you to retry requests from a spider callback.
From the docs:
Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.
So you could do something like:
def parse(self, response):
if response.status in [301, 302, 307, 429]:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='tried to redirect',
max_retry_times = 10
)
if new_request_or_none:
yield new_request_or_none
else:
# exhausted all retries
...
But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.
Edit 2:
On Twisted versions older than 21.7.0, the coroutine async_sleep implementation using deferLater probably won't work. Use this instead:
from twisted.internet import defer, reactor
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
Original answer:
If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?
In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware:
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class CustomRedirectMiddleware(RedirectMiddleware):
"""
Modifies RedirectMiddleware to set response status to 503 on redirects.
Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
(or whatever the downloader middleware responsible for retrying on status 503 is called).
"""
def process_response(self, request, response, spider):
if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list
return response.replace(status=503) # Now this response is RetryMiddleware's problem
return super().process_response(request, response, spider)
However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if, like checking the existence of some header that could indicate site maintenance or something like that.
While we are at it, since you included status code 429 in your list, I assume you may be getting some "Too Many Requests" responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware:
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, callable=None, *args, **kw):
return await task.deferLater(reactor, delay, callable, *args, **kw)
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs
MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
Don't forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES in your project's settings.py:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'your_project_name.middlewares.CustomRedirectMiddleware': 600
}
As I understand it, Scrapy is single threaded but async on the network side. I am working on something which requires an API call to an external resource from within the item pipeline. Is there any way to make the HTTP request without blocking the pipeline and slowing down Scrapy from crawling?
Thanks
You can do it by scheduling a request directly to a crawler.engine via crawler.engine.crawl(request, spider). To do that however fist you need to expose crawler in your pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
if item['some_extra_field']: # check if we already did below
return item
url = 'some_url'
req = scrapy.Request(url, self.parse_item, meta={'item':item})
self.crawler.engine.crawl(req, spider)
raise DropItem() # we will get this item next time
def parse_item(self, response):
item = response.meta['item']
item['some_extra_field'] = '...'
return item
I'm trying to get scrapy to grab a URL from a message queue, and then scrape that URL. I have the loop going just fine and grabbing the URL from the queue, but it never enters the parse() method once it has a url, it just continues to loop (and sometimes the url comes back around even though I've deleted it from the queue...)
While it's running in terminal, if I CTRL+C and force it to end, it enters the parse() method and crawls the page, then ends. I'm not sure what's wrong here.
class my_Spider(Spider):
name = "my_spider"
allowed_domains = ['domain.com']
def __init__(self):
super(my_Spider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
url = None
while url is None:
conf = {
"sqs-access-key": "",
"sqs-secret-key": "",
"sqs-queue-name": "crawler",
"sqs-region": "us-east-1",
"sqs-path": "sqssend"
}
# Connect to AWS
conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
q = conn.get_queue(conf.get('sqs-queue-name'))
message = conn.receive_message(q)
# Didn't get a message back, wait.
if not message:
time.sleep(10)
url = None
else:
url = message
if url is not None:
message = url[0]
message_body = str(message.get_body())
message.delete()
self.url = message_body
return self.url
def parse(self, response):
...
yield item
Updated from comments:
def start_requests(self):
while True:
# Crawl the url from queue
queue = self._pop_queue()
self.logger.error(queue)
if queue is None:
time.sleep(10)
continue
url = queue
if url:
yield self.make_requests_from_url(url)
Removed the while url is None: loop, but still get the same problem.
Would I be right to assume that if this works:
import scrapy
import random
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
def __init__(self):
super(ExampleSpider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
return 'http://www.example.com/?{}'.format(random.randint(0,100000))
def parse(self, response):
print "Successfully parsed!"
Then your code should work as well, unless:
There's a problem with allowed_domains and your queue actually returns URLs outside it
There's a problem with your queue() function and/or the data it produces e.g. it returns arrays or it blocks indefinitely or something like that
Note also that the boto library is blocking and not Twisted/asynchronous. In order to not block scrapy while using it, you will have to use a Twisted-compatible library like txsqs. Alternatively you might want to run boto calls in a separate thread with deferToThread.
After your follow up question in Scrapy list, I believe that you have to understand that your code is quite far from functional and this makes it as much as generic Boto/SQS question as Scrapy question. Anyway - here's an average functional solution.
I've created and AWS SQS with this properties:
Then gave it some overly broad permissions:
Now I'm able to submit messages in the queue with AWS CLi like this:
$ aws --region eu-west-1 sqs send-message --queue-url "https://sqs.eu-west-1.amazonaws.com/123412341234/my_queue" --message-body 'url:https://stackoverflow.com'
For some weird reason - I think that when I was setting --message-body to a URL it was actually downloading the page and sending the result as message body(!) Not sure - don't have time to confirm this, but interesting. Anyway.
Here's a proper'ish Spider code. As I said before, boto is blocking API which is bad. In this implementation I call its API just once from start_requests() and then only when the spider is idle on the spider_idle() callback. At that point, because the spider is idle, the fact that boto is blocking doesn't pose much of a problem. While I pull URLs from SQS, I pull as many as possible with the while loop (you could put a limit there if you don't want to consume e.g. more than 500 at a time) in order to have to call the blocking API as rarely as possible. Notice also the call to conn.delete_message_batch() which actually removes messages from the queue (otherwise they just stay there for ever) and the queue.set_message_class(boto.sqs.message.RawMessage) that avoids this problem.
Overall this might be an ok solution for your level of requirements.
from scrapy import Spider, Request
from scrapy import signals
import boto.sqs
from scrapy.exceptions import DontCloseSpider
class CPU_Z(Spider):
name = "cpuz"
allowed_domains = ['http://valid.x86.fr']
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CPU_Z, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
def __init__(self, *args, **kwargs):
super(CPU_Z, self).__init__(*args, **kwargs)
conf = {
"sqs-access-key": "AK????????????????",
"sqs-secret-key": "AB????????????????????????????????",
"sqs-queue-name": "my_queue",
"sqs-region": "eu-west-1",
}
self.conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
self.queue = self.conn.get_queue(conf.get('sqs-queue-name'))
assert self.queue
self.queue.set_message_class(boto.sqs.message.RawMessage)
def _get_some_urs_from_sqs(self):
while True:
messages = self.conn.receive_message(self.queue, number_messages=10)
if not messages:
break
for message in messages:
body = message.get_body()
if body[:4] == 'url:':
url = body[4:]
yield self.make_requests_from_url(url)
self.conn.delete_message_batch(self.queue, messages)
def spider_idle(self, spider):
for request in self._get_some_urs_from_sqs():
self.crawler.engine.crawl(request, self)
raise DontCloseSpider()
def start_requests(self):
for request in self._get_some_urs_from_sqs():
yield request
def parse(self, response):
yield {
"freq_clock": response.url
}
Is it possible to delay the retry of a particular scrapy Request. I have a middleware which needs to defer the request of a page until a later time. I know how to do the basic deferal (end of queue), and also how to delay all requests (global settings), but I want to just delay this one individual request. This is most important near the end of the queue, where if I do the simple deferral it immediately becomes the next request again.
Method 1
One way would be to add a middleware to your Spider (source, linked):
# File: middlewares.py
from twisted.internet import reactor
from twisted.internet.defer import Deferred
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay_s = request.meta.get('delay_request_by', None)
if not delay_s:
return
deferred = Deferred()
reactor.callLater(delay_s, deferred.callback, None)
return deferred
Which you could later use in your Spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
}
def start_requests(self):
# This request will have itself delayed by 5 seconds
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',
meta={'delay_request_by': 5})
# This request will not be delayed
yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
def parse(self, response):
... # Process results here
Method 2
You could do this with a Custom Retry Middleware (source), you just need to override the process_response method of the current Retry Middleware:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
# Your delay code here, for example sleep(10) or polling server until it is alive
return self._retry(request, reason, spider) or response
return response
Then enable it instead of the default RetryMiddleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
A solution that uses twisted.reactor.callLater() is here:
https://github.com/ArturGaspar/scrapy-delayed-requests
sleep() Method suspends execution for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time.
So that you have to import time module in your spider.
import time
Then you can add the sleep method where you need the delay.
time.sleep( 5 )
I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).
I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware:
def process_response(request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
return IgnoreRequest()
else:
return response
and I add this to the dict of middlewares:
DOWNLOADER_MIDDLEWARES = {
'acny.middlewares.CustomDownloaderMiddleware': 650
}
with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.
However, when I run the crawler, I get an error saying:
ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'
This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?
I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware(object):
def process_response(self, request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
raise IgnoreRequest()
else:
return response
That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.
If you know which requests are redirected to the problematic ones, how about something like:
def parse_requests(self, response):
....
meta = {'handle_httpstatus_list': [301, 302]}
callback = 'process_redirects'
yield Request(url, callback=callback, meta=meta, ...)
def process_redirects(self, response):
url = response.headers['location']
if url is no good:
return
else:
...
This way you avoid downloading useless responses.
And you can always define your own custom redirect middleware.