I don't understand how requests works. I don't know why scrapy.Request(next_page, callback=self.parse2) doesn't work as I thought. Pitem is response of this request, but I i have KeyError: 'pitem' I don't know why the first request works, but request2 not
MY CODE:
spider.py
...
def parse(self, response):
...
request = scrapy.Request(link, callback=self.parse2)
request.meta['item'] = item
yield request
...
def parse2(self, response)
item = response.meta['item']
pitem = response.meta['pitem']
...
pitem['field'].append(self.whatever)
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['pitem'] = item
yield request2
else:
yield pitem
self.whatever = []
request = scrapy.Request(link, callback=self.parse2)
request.meta['item'] = item
yield request
This does define a meta variable with the name item but not with pitem. So when you call function parse2 and say pitem = response.meta['pitem'] it is unable to find pitem in the request meta data.
One possible solution might be to use pitem = response.meta.get('pitem') which will return a None value if it's unable to find pitem but that depends very much on your usecase.
Related
The original code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
def parse_details(self, response, item):
# harvest details
...
yield item
This is the standard way to follow links on a page. However it has a flaw: if there is an http error (e.g. 503) or connection error when following the 2nd URL, parse_details is never called, and yield item is never executed. And so all data is lost.
Changed code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
yield item
def parse_details(self, response, item):
# harvest details
...
Changed code does not work, it seems yield item is immediately executed before parse_details is run (perhaps due to Twisted framework, this behavior is different from what's expected in asynio library) and so the item is always yielded with incomplete data.
How to make sure the yield item is executed after all links are followed? regardless of success or failure. Is something like
res1 = scrapy.Request(...)
res2 = scrapy.Request(...)
yield scrapy.join([res1, res2]) # block until both urls are followed?
yield item
possible?
you can send the failed requests to a function (whenever an Error happens),yield the item from there.
from scrapy.spidermiddlewares.httperror import HttpError
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
meta={"item": item},
errback=self.my_handle_error
)
def parse_details(self, response):
item = response.meta['item']
# harvest details
...
yield item
def my_handle_error(self,failure,item):
response = failure.value.response
print(f"Error on {response.url}")
#you can do much depth error checking here to see what type of failure like DNSlookup,timeouterror,httperror ...
yield item
second Edit to yield the item
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item},
errback=errback=lambda failure, item=item: self.my_handle_error(failure, item)
)
def my_handle_error(self,failure,item):
yield item
CODE
spider.py
...
def parse(self, response):
for one_item in response.xpath('path1'):
item = ProjectItem()
request = scrapy.Request(one_item.xpath('path2'), callback=self.parse2)
request.meta['item'] = item
yield request
property = []
def parse2(self, response)
item = response.meta['item']
for x in response.xpath('path3')
self.property.append('path4')
next_page = response.xpath('path5')
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['item'] = item
yield request2
else:
item['field'] = self.property
self.property = []
yield item
Problem is that when spider crawl to next_page. Some self.property is assign to wrong items. I don't know how to repair it.
self.property is a class attribute that is shared among all calls to parse2 and you can't control the order of each call to parse2 .
To solve that you need to pass the property list inside the meta or as a item attribute:
def parse(self, response):
for one_item in response.xpath('path1'):
item = ProjectItem()
item['field'] = []
request = scrapy.Request(one_item.xpath('path2'), callback=self.parse2)
request.meta['item'] = item
yield request
def parse2(self, response)
item = response.meta['item']
for x in response.xpath('path3')
item['field'].append('path4')
next_page = response.xpath('path5')
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['item'] = item
yield request2
else:
yield item
My spide looks like this/;
class ScrapeMovies(scrapy.Spider):
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = loopitem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
yield item
# This part is responsible for scraping all of the pages on a start url commented out for convinience
# next_page=response.xpath('//div[#class="page-nav-btm"]/ul/li[last()]/a/#href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
What it does as of know it scrapes the table (see the starting url). I want it to then go the link (members name column) and then extract some informations from this link (link is e.g. https://www.trekearth.com/members/monareng/) and the return this as an item.
How should i approach this?
If anything is unclear please do not hesitate to ask for clarification.
EDIT:
nowy my code looks as follows (however still does not work):
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
item['groups'] = response.xpath('//div[#class="groups-btm"]/ul/li/text()').extract_first()
return item
Use meta field to put item forward to next callback
def parse_page1(self, response):
item = MyItem(main_url=response.url)
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
UPD: to process all rows use a yield in your loop
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
yield request
I need to make 2 request to different urls and put that information to the same item. I have tried this method, but the result is written in different rows. The callbacks returns item. I have tried many methods but none seems to work.
def parse_companies(self, response):
data = json.loads(response.body)
if data:
item = ThalamusItem()
for company in data:
comp_id = company["id"]
url = self.request_details_URL + str(comp_id) + ".json"
request = Request(url, callback=self.parse_company_details)
request.meta['item'] = item
yield request
url2 = self.request_contacts + str(comp_id)
yield Request(url2, callback=self.parse_company_contacts, meta={'item': item})
Since scrapy is asynchronious you need to chain your requests manually. For transfering data between requests you can use Request's meta attribute:
def parse(self, response):
item = dict()
item['name'] = 'foobar'
yield request('http://someurl.com', self.parse2,
meta={'item': item})
def parse2(self, response):
print(response.meta['item'])
# {'name': 'foobar'}
In your case you end up with a split chain when you should have one continuous chain.
Your code should look something like this:
def parse_companies(self, response):
data = json.loads(response.body)
if not data:
return
for company in data:
item = ThalamusItem()
comp_id = company["id"]
url = self.request_details_URL + str(comp_id) + ".json"
url2 = self.request_contacts + str(comp_id)
request = Request(url, callback=self.parse_details,
meta={'url2': url2, 'item': item})
yield request
def parse_details(self, response):
item = response.meta['item']
url2 = response.meta['url2']
item['details'] = '' # add details
yield Request(url2, callback=self.parse_contacts, meta={'item': item})
def parse_contacts(self, response):
item = response.meta['item']
item['contacts'] = '' # add details
yield item
I use next code in my spider:
def parse_item(self, response):
item = MyItem()
item['price'] = [i for i in self.get_usd_price(response)]
return item
def get_usd_price(self, response):
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
def get_currency(self, response):
self.log('lalalalala')
The problem is I can't reach my get_currency callback. In my logger I see that price item takes [<POST url>] value. What am I doing wrong? I tried to add dont_filter to FormRequest, change FormRequest to simple get Request
Update
I've also tried GHajba's suggestion (so far without success):
def parse_item(self, response):
item = MyItem()
self.get_usd_price(response, item)
return item
def get_usd_price(self, response, item):
request = FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
request.meta['item'] = item
yield request
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123
return item
This is not how scrapy works, you can only yield a request or an item on every method, but you can't yield the response this way, If you want to update the price information for the item and then yield it you should do something like:
def parse_item(self, response):
item = MyItem()
# populate the item with this response data
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency, meta={'item':item}
)
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123 # get your price from the response body.
# keep populating the item with this response data
yield item
So check that for passing information between requests, you need to use the meta parameter.
Your problem is that you assign the values of the generator created in get_usd_price to your item. You can solve this with changing the method and how you call it.
You have to yield the FormRequest but you mustn't use this value to have an effect with Scrapy. Just call the function get_usd_price without assigning it to item['price']:
self.get_usd_price(response, item)
You have to provide item to your function because Scrapy works asynchronous so you cannot be sure when the FormRequest is executing. Now you have to pass along the item as a meta parameter of the FormRequest and then you can access the item in the get_currency function and yield the item there.
You can read more about meta in the docs: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta