Understanding callbacks in Scrapy - python

I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.

Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for the
URLs specified in the start_urls and the parse method as callback
function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file
using Feed exports.
Answers:
How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.
Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.

yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call

in scrapy: understanding how do items and requests work between callbacks
,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item']
#response.meta is equal to request.meta,so here item['main_url']
#="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recording urls of page1 and page2.

Related

How to understand callback function in scrapy.Request?

I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the
response of this request (once it’s downloaded) as its first
parameter. For more information see Passing additional data to
callback functions below. If a Request doesn’t specify a callback, the
spider’s parse() method will be used. Note that if exceptions are
raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
I didn't see where resp is passed in
Why need to put self keyword before parse in the argument
self keyword was never used in parse function, why bothering put it as first parameter?
can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))
Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.
Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.
yield scrapy.Request(url=url) #or use return like you did
Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.
Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class
class ArticleSpider(scrapy.Spider): # <<<<<<<< here
name='article'
So a TL; DR of your questions:
1-You didn't saw it because it happened in the parent class.
2-You need to use self. so python knows you are referencing a method of the spider instance.
3-The self parameter was the instance itself, and it was used by python.
4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers
information about self you can find here - https://docs.python.org/3/tutorial/classes.html
about this question:
can we extract URL from response parameter like this: url = response.url or should be url = self.url
you should use response.url to get URL of the page which you currently crawl/parse

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.

Access response from spider in items pipeline in scrapy

I have spider like
class ProductsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/url'
]
def parse(self, response):
And I have a pipeline class like this
class ProductsDataPipeline(object):
""" Item pipeline for products data crawler """
def process_item(self, item, spider):
return item
But I want get response argument of parse function in parse_item function without setting as an attribute to item object,Is it possible
No it's not possible.
Responses are not forwarded to pipelines. You either have to store response in item or use some external storage to store response and fetch it in pipeline. Second option is much better, and avoids many problems that can result from storing response in item (e.g. memory problems). For example you save response to some form of storage in parse callback, save reference to this storage in item field, and fetch response from storage in pipeline.
But it really depends on what you are trying to do, response is available in spider middleware process_spider_output so perhaps you can use it instead of processing item in pipeline.

Scrapy multiple spiders

I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)

Scrapy: Item details on 3 different pages

I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.

Categories