Access response from spider in items pipeline in scrapy - python

I have spider like
class ProductsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/url'
]
def parse(self, response):
And I have a pipeline class like this
class ProductsDataPipeline(object):
""" Item pipeline for products data crawler """
def process_item(self, item, spider):
return item
But I want get response argument of parse function in parse_item function without setting as an attribute to item object,Is it possible

No it's not possible.
Responses are not forwarded to pipelines. You either have to store response in item or use some external storage to store response and fetch it in pipeline. Second option is much better, and avoids many problems that can result from storing response in item (e.g. memory problems). For example you save response to some form of storage in parse callback, save reference to this storage in item field, and fetch response from storage in pipeline.
But it really depends on what you are trying to do, response is available in spider middleware process_spider_output so perhaps you can use it instead of processing item in pipeline.

Related

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.

Load Item fileds with ItemLoader across multiple responses

This is a followup question to accepted answer to question Scrapy: populate items with item loaders over multiple pages. I want to use ItemLoader to collect values from multiple requests to a single Item. The accepted answer suggests that the loaded Item.load_item() should be passed to the next request via meta field in request.
However, I would like to apply output_processors to all collected values of a single field when returning the loaded object at the end of the crawl.
Questions
What would be the best way to achieve it?
Can I pass the ItemLoader instance over meta to next request without loading it and then just replace the selector or response elements in the ItemLoader when adding the values or xpaths from the next response?
Example:
def parse(self, response):
loader = TheLoader(item=TestItems(), response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
request = Request(
"https://en.wikipedia.org/wiki/2016_Rugby_Championship",
callback=self.parsePage1,
meta={'loader': loader},
dont_filter=True
)
yield request
def parsePage1(self, response):
loader = response.meta['loader']
loader.response = response
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
return loader.load_item()
Ignore the context of the actual websites.
Yes, you can just pass the ItemLoader instance.
If I recall this correctly from irc or github chat way long ago, there might be some potential issues with doing this, like increased memory usage or leaks from reference handling, because you carry around object references of ItemLoader instances (and processors?) and potentially over long times, depending on the order of your download queues, by binding these itemloader instances to those requests.
So keep that in mind and perhaps beware of using this style on large crawls, or do some memory debugging to be certain.
However, I used this method extensively in the past myself (and would still do so when using ItemLoaders), and haven't seen any problems with that approach myself.
Here is how I do that:
import scrapy
from myproject.loader import ItemLoader
class TheLoader(ItemLoader):
pass
class SomeSpider(scrapy.Spider):
[...]
def parse(self, response):
loader = TheLoader(item=TestItems(), response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship",
callback=self.parsePage1,
dont_filter=True
)
request.meta['loader'] = loader
yield request
def parsePage1(self, response):
loader = response.meta['loader']
# rebind ItemLoader to new Selector instance
#loader.reset(selector=response.selector, response=response)
# skipping the selector will default to response.selector, like ItemLoader
loader.reset(response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
return loader.load_item()
This requires using a customized ItemLoader class, which can be found in my scrapy scrapyard,
but the relevant part of the class is here:
from scrapy.loader import ItemLoader as ScrapyItemLoader
class ItemLoader(ScrapyItemLoader):
""" Extended Loader
for Selector resetting.
"""
def reset(self, selector=None, response=None):
if response is not None:
if selector is None:
selector = self.default_selector_class(response)
self.selector = selector
self.context.update(selector=selector, response=response)
elif selector is not None:
self.selector = selector
self.context.update(selector=selector)

Understanding callbacks in Scrapy

I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.
Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for the
URLs specified in the start_urls and the parse method as callback
function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file
using Feed exports.
Answers:
How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.
Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.
yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call
in scrapy: understanding how do items and requests work between callbacks
,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item']
#response.meta is equal to request.meta,so here item['main_url']
#="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recording urls of page1 and page2.

Scrapy: Item details on 3 different pages

I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.

handle redirected response with proper parser

I am crawling a site with scrapy. The parse method first extracts all the category links and then dispatch a request with callback to parse_category.
The problem is if any of the category has one product it redirects to the products page. And my parse_category fails to recognize this page.
Now how do I parse that redirectted category page with product page parser?
Here is an example.
parse finds 3 category pages.
http://example.com/products/samsung
http://example.com/products/dell
http://example.com/products/apple
pare_category calls all those pages. Each returns a html page with list of product. But apple has one single product iMac 27". So it redirects to http://example.com/products/apple/imac_27. This is a product page.The category parse fails to parse it.
I need the product parse method parse_product should be called in this scenario. How do I do that?
I can add some logic in my parse_category method and call parse_product. I dont want it. I want scrapy will do it. But yes, I'll give url patterns or any other info necessary.
Here is the code.
class ExampleSpider(BaseSpider):
name = u'example.com'
allowed_domains = [u'www.example.com']
start_urls = [u'http://www.example.com/category.aspx']
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('/xpath')
for anchor in anchors:
yield Request(urljoin(get_base_url(response), anchor), callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(url, callback=self.parse_product)
def parse_product(self, response):
# product parsing ...
pass
You can opt to write a middleware which implements the process_response method. Whenever your response is for a product URL instead of a category, create a copy of the Request object and change the callback function to your product parser.
In the end, return the new Request object from the middleware. Note: You might need to set dont_filter to True for the new Request to ensure the DupeFilter doesn't filter the Request.

Categories