I use next code in my spider:
def parse_item(self, response):
item = MyItem()
item['price'] = [i for i in self.get_usd_price(response)]
return item
def get_usd_price(self, response):
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
def get_currency(self, response):
self.log('lalalalala')
The problem is I can't reach my get_currency callback. In my logger I see that price item takes [<POST url>] value. What am I doing wrong? I tried to add dont_filter to FormRequest, change FormRequest to simple get Request
Update
I've also tried GHajba's suggestion (so far without success):
def parse_item(self, response):
item = MyItem()
self.get_usd_price(response, item)
return item
def get_usd_price(self, response, item):
request = FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
request.meta['item'] = item
yield request
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123
return item
This is not how scrapy works, you can only yield a request or an item on every method, but you can't yield the response this way, If you want to update the price information for the item and then yield it you should do something like:
def parse_item(self, response):
item = MyItem()
# populate the item with this response data
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency, meta={'item':item}
)
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123 # get your price from the response body.
# keep populating the item with this response data
yield item
So check that for passing information between requests, you need to use the meta parameter.
Your problem is that you assign the values of the generator created in get_usd_price to your item. You can solve this with changing the method and how you call it.
You have to yield the FormRequest but you mustn't use this value to have an effect with Scrapy. Just call the function get_usd_price without assigning it to item['price']:
self.get_usd_price(response, item)
You have to provide item to your function because Scrapy works asynchronous so you cannot be sure when the FormRequest is executing. Now you have to pass along the item as a meta parameter of the FormRequest and then you can access the item in the get_currency function and yield the item there.
You can read more about meta in the docs: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
Related
The original code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
def parse_details(self, response, item):
# harvest details
...
yield item
This is the standard way to follow links on a page. However it has a flaw: if there is an http error (e.g. 503) or connection error when following the 2nd URL, parse_details is never called, and yield item is never executed. And so all data is lost.
Changed code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
yield item
def parse_details(self, response, item):
# harvest details
...
Changed code does not work, it seems yield item is immediately executed before parse_details is run (perhaps due to Twisted framework, this behavior is different from what's expected in asynio library) and so the item is always yielded with incomplete data.
How to make sure the yield item is executed after all links are followed? regardless of success or failure. Is something like
res1 = scrapy.Request(...)
res2 = scrapy.Request(...)
yield scrapy.join([res1, res2]) # block until both urls are followed?
yield item
possible?
you can send the failed requests to a function (whenever an Error happens),yield the item from there.
from scrapy.spidermiddlewares.httperror import HttpError
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
meta={"item": item},
errback=self.my_handle_error
)
def parse_details(self, response):
item = response.meta['item']
# harvest details
...
yield item
def my_handle_error(self,failure,item):
response = failure.value.response
print(f"Error on {response.url}")
#you can do much depth error checking here to see what type of failure like DNSlookup,timeouterror,httperror ...
yield item
second Edit to yield the item
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item},
errback=errback=lambda failure, item=item: self.my_handle_error(failure, item)
)
def my_handle_error(self,failure,item):
yield item
My spide looks like this/;
class ScrapeMovies(scrapy.Spider):
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = loopitem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
yield item
# This part is responsible for scraping all of the pages on a start url commented out for convinience
# next_page=response.xpath('//div[#class="page-nav-btm"]/ul/li[last()]/a/#href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
What it does as of know it scrapes the table (see the starting url). I want it to then go the link (members name column) and then extract some informations from this link (link is e.g. https://www.trekearth.com/members/monareng/) and the return this as an item.
How should i approach this?
If anything is unclear please do not hesitate to ask for clarification.
EDIT:
nowy my code looks as follows (however still does not work):
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
item['groups'] = response.xpath('//div[#class="groups-btm"]/ul/li/text()').extract_first()
return item
Use meta field to put item forward to next callback
def parse_page1(self, response):
item = MyItem(main_url=response.url)
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
UPD: to process all rows use a yield in your loop
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
yield request
I am new to scrapy and I've come across a complicated case.
I have to make 3 get requests in order to make Product items.
product_url
category_url
stock_url
First, I need a request to product_url and a request to category_url to fill out the fields of Product items. I then need to refer to stock_url's response to determine whether to save or discard the created items.
Here's what I'm doing right now:
In my spider,
def start_requests(self):
product_url = 'https://www.someurl.com/product?'
item = ProductItem()
yield scrapy.Request(product_url, self.parse_products, meta={'item':item})
def parse_products(self, response):
# fill out 1/2 of fields of ProductItem
item = response.meta['item']
item[name] = response.xpath(...)
item[id] = response.xpath(...)
category_url = 'https://www.someurl.com/category?'
yield scrapy.Request(category_url, self.parse_products2, meta={'item':item})
def parse_products2(self, response):
# fill out rest of fields of ProductItem
item = response.meta['item']
item[img_url] = response.xpath(...)
item[link_url] = response.xpath(...)
stock_url = 'https://www.someurl.com/stock?'
yield scrapy.Request(stock_url, self.parse_final, meta={'item':item})
def parse_final(self, response):
item = response.meta['item']
for each prod in response:
if prod.id == item['id'] & !prod.in_stock:
#drop item
Question: I was told before to handle the item-dropping logic in the pipeline. But whether I drop an item or not depends on making another GET request. Should I still move this logic to the pipelines/ is this possible without inheriting scrapy.Spider?
Moving the item dropping logic to the pipeline is probably the best design.
You can use the (undocumented) scrapy engine api to download requests in a pipeline. Example assuming the stock info for all items can be accessed from a single url:
import scrapy
from scrapy.exceptions import DropItem
from twisted.internet.defer import inlineCallbacks
class StockPipeline(object):
#inlineCallbacks
def open_spider(self, spider):
req = scrapy.Request(stock_url)
response = yield spider.crawler.engine.download(req, spider)
# extract the stock info from the response
self.stock_info = response.text
def process_item(self, item, spider):
# check if the item should be dropped
if item['id'] not in self.stock_info:
raise DropItem
return item
If there is a separate, per-item url for stock info, you'd simply do the downloading in process_item() instead.
I'm trying to use parse as a central hub to gather dicts from other parse functions since I need to use multiple URL and domains but I can't seem to figure out how to get all the data back into one dictionary. Here is an example of what I'm trying to do:
def parse(self, response):
request_1 = scrapy.Request(domain_1_url, callback = parse_2)
request_2 = scrapy.Request(domain_2_url, callback = parse_3)
#unpacks the two dicts from the 2 requests and return them as 1 to the pipeline
yield {**request_1, **request_2} #what I want to do, doesn't work because they are requests, not dicts
def parse_2(self, response):
yield {'someKey': 'some value'}
def parse_3(self, response):
yield {'someOtherKey': 'some more values'}
Is there someway I can achieve this? Or is there a better way to do it like throwing everything to the pipeline and have that handle combining the data instead?
EDIT:
I thought about passing an Item object to the other requests thinking that modifying it in the other requests will also modify the one in the main parse because of Python's "call-by-object" properties:
def parse(self, response):
item = CustomItem()
request_1 = scrapy.Request(domain_1_url, callback = parse_2)
request_1.meta['item'] = item
request_2 = scrapy.Request(domain_2_url, callback = parse_3)
request_2.meta['item'] = item
print(item) #still empty
yield item
def parse_2(self, response):
item = response.meta['item']
item['someKey'] = 'some value'
yield item
def parse_3(self, response):
item = response.meta['item']
item['someOtherKey'] = 'some more values'
yield item
Not quite sure why this doesn't work but the item object in parse() is still empty after calling the 2 requests
Credits to #TomášLinhart for the solution in the comments above
What I ended up doing is chaining the 3 parses instead of using one as a central hub. I really dislike this solution since parse_2 and parse_3 has no relationship with one another and they each crawl on different domains in my situation so if anyone have a more elegant solution please feel free to answer
def parse(self, response):
item = CustomItem()
request_1 = scrapy.Request(domain_1_url, callback = parse_2)
request_1.meta['item'] = item
yield request_1
def parse_2(self, response):
item = response.meta['item']
item['someKey'] = 'some value'
request_2 = scrapy.Request(domain_2_url, callback = parse_3)
request_2.meta['item'] = item
yield request_2
def parse_3(self, response):
item = response.meta['item']
item['someOtherKey'] = 'some more values'
yield item
I am building a simple(ish) parser in Scrapy and I am blissfully ignorant when it comes to scrapy and Python :-) In the file item.py I have a definition of thisItem() which I assign to item in the code below. All worked rather swimmingly, parseusing a callback to get to parse_dir_content... But then I realized I needed to scrape an extra bit of data and created another function parse_other_content. How do I get what is already in item into parse_other_content?
import scrapy
from this-site.items import *
import re
import json
class DmozSpider(scrapy.Spider):
name = "ABB"
allowed_domains = ["this-site.com.au"]
start_urls = [
"https://www.this-site.com.au?page=1",
"https://www.this-site.com.au?page=2",
]
def parse(self, response):
for href in response.xpath('//h3/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[#itemprop="name"]'):
item = thisItem()
item['title'] = sel.xpath('text()').extract()
item['rate'] = response.xpath('//div[#class="rate"]/div/span/text()').extract()
so = re.search( r'\d+', response.url)
propID = so.group()
item['propid'] = propID
item['link'] = response.url
yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
#yield item
def parse_other_content(self, reponse):
sel = json.loads(reponse.body)
item['rate_detail'] = sel["this"][0]["that"]
yield item
I know I am missing something simple here, but I can't seem to figure it out.
Per the scrapy documentation (http://doc.scrapy.org/en/1.0/topics/request-response.html#topics-request-response-ref-request-callback-arguments):
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
In your case I would do something like this:
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[#itemprop="name"]'):
item = thisItem()
...
request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
request.meta['item'] = item
yield request
def parse_other_content(self, response):
item = response.meta['item']
# do something with the item
return item
According to Steve (see comments) you can also pass a dictionary of meta data as a keyword argument to the Request constructor like so:
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[#itemprop="name"]'):
item = thisItem()
...
request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content, meta={'item':item})
yield request
You can either allow item to be visible to parse_other_content() by changing it to self.item, or sending it as a parameter to the function. (The first one might be easier.)
For the first solution just add self. to any reference to the item variable. This makes it visible to the entire class.
def parse_dir_contents(self, response):
for sel in response.xpath('//h1[#itemprop="name"]'):
self.item = thisItem()
self.item['title'] = sel.xpath('text()').extract()
self.item['rate'] = response.xpath('//div[#class="rate"]/div/span/text()').extract()
so = re.search( r'\d+', response.url)
propID = so.group()
self.item['propid'] = propID
self.item['link'] = response.url
yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
#yield item
def parse_other_content(self, reponse):
sel = json.loads(reponse.body)
self.item['rate_detail'] = sel["this"][0]["that"]
yield self.item