My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps
Related
I'm trying to iterate through a list of URL's return from the callback passed to scrapy request, but I'm getting the following error:
TypeError: 'Request' object is not iterable
The following works. I can see all the extracted URL's flood the terminal:
import scrapy
class PLSpider(scrapy.Spider):
name = 'pl'
start_urls = [ 'https://example.com' ]
def genres(self, resp):
for genre in resp.css('div.sub-menus a'):
yield {
'genre': genre.css('::text').extract_first(),
'url': genre.css('::attr(href)').extract_first()
}
def extractSamplePackURLs(self, resp):
return {
'packs': resp.css('h4.product-title a::attr(href)').extract()
}
def extractPackData(self, resp):
return {
'title': resp.css('h1.product-title::text'),
'description': resp.css('div.single-product-description p').extract_first()
}
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
yield samplePacks
But if I replace the yield samplePacks line with:
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
for pack in samplePacks:
yield pack
... I get the error I posted above.
Why is this and how can I loop through the returned value of the callback?
Yielding Request objects in scrapy.Spider callbacks only tells Scrapy framework to enqueue HTTP requests. It yields HTTP requests objects, just that. It does not download them immediately. Or give back control until they are downloaded, ie. after the yield, you still don't have the result. Request objects are not promises, futures, deferred. Scrapy is not designed the same as various async frameworks.
These Request objects will eventually get processed by the framework's downloader, and the response body from each HTTP request will be passed to the associated callback.
This is the basis of Scrapy's asynchronous programming pattern.
If you want to do something more "procedural-like" in which yield request(...) gets you the HTTP response the next time you have control, you can have a look at https://github.com/rmax/scrapy-inline-requests/.
I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.
I want to asynchronously query a database for keys, then make requests to several urls for each key.
I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests.
#inlineCallbacks
def get_request_deferred(self):
d = yield engine.execute(select([table])) # async
d.addCallback(make_url)
d.addCallback(Request)
return d
def start_requests(self):
????
But attempting this in several ways raises
builtins.AttributeError: 'Deferred' object has no attribute 'dont_filter'
which I take to mean that start_requests must return Request objects, not Deferreds whose values are Request objects. The same seems to be true of spider middleware's process_start_requests().
Alternatively, I can make initial requests to, say, http://localhost/ and change them to the real url once the key is available from the database through downloader middleware's process_request(). However, process_request only returns a Request object; it cannot yield Requests to multiple pages using the key: attempting yield Request(url) raises
AssertionError: Middleware myDownloaderMiddleware.process_request
must return None, Response or Request, got generator
What is the cleanest solution to
get key asynchronously from database
for each key, generate several requests
You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.
You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):
import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
self.urls = Queue()
self.stop = False
self.requests = request_generator()
self.deferred = deferred_generator()
def deferred_generator(self):
d = Deferred()
d.addCallback(self.deferred_callback)
yield d
def request_generator(self):
while not self.stop:
url = self.urls.get()
yield scrapy.Request(url=url, callback=self.parse)
def start_requests(self):
return self.requests.next()
def parse(self, response):
st()
# when you need to parse the next url from the callback
yield self.requests.next()
#static_method
def deferred_callback(url):
self.urls.put(url)
if no_more_urls():
self.stop = True
Don't forget to stop the request generator when you're done.
There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).
Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response)
Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?
Follow the EAFP principle:
Easier to ask for forgiveness than permission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
Handle an exception and yield a Request to the current url with dont_filter=True:
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler. This is used when you want to perform an
identical request multiple times, to ignore the duplicates filter. Use
it with care, or you will get into crawling loops. Default to False.
def parse(response):
try:
# parsing logic here
except AttributeError:
yield Request(response.url, callback=self.parse, dont_filter=True)
You can also make a copy of the current request (not tested):
new_request = response.request.copy()
new_request.dont_filter = True
yield new_request
Or, make a new request using replace():
new_request = response.request.replace(dont_filter=True)
yield new_request
How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?
In settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scraper.middlewares.retry.RetryMiddleware': 550,
}
Then your retry middleware could be like:
from scrapy.downloadermiddlewares.retry import RetryMiddleware \
as BaseRetryMiddleware
class RetryMiddleware(BaseRetryMiddleware):
def process_response(self, request, response, spider):
# inject retry method so request could be retried by some conditions
# from spider itself even on 200 responses
if not hasattr(spider, '_retry'):
spider._retry = self._retry
return super(RetryMiddleware, self).process_response(request, response, spider)
And then in your success response callback you can call for ex.:
yield self._retry(response.request, ValueError, self)
From Scrapy 2.5.0 there is a new method get_retry_request().
It's pretty easy, the example from the Scrapy docs:
def parse(self, response):
if not response.text:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='empty',
)
return new_request_or_none
in your existing code, you can simply allow duplicate filter=True
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response, dont_filter=True)
I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).
I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware:
def process_response(request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
return IgnoreRequest()
else:
return response
and I add this to the dict of middlewares:
DOWNLOADER_MIDDLEWARES = {
'acny.middlewares.CustomDownloaderMiddleware': 650
}
with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.
However, when I run the crawler, I get an error saying:
ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'
This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?
I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware(object):
def process_response(self, request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
raise IgnoreRequest()
else:
return response
That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.
If you know which requests are redirected to the problematic ones, how about something like:
def parse_requests(self, response):
....
meta = {'handle_httpstatus_list': [301, 302]}
callback = 'process_redirects'
yield Request(url, callback=callback, meta=meta, ...)
def process_redirects(self, response):
url = response.headers['location']
if url is no good:
return
else:
...
This way you avoid downloading useless responses.
And you can always define your own custom redirect middleware.