Scrapy multiple requests and fill single item - python

I need to make 2 request to different urls and put that information to the same item. I have tried this method, but the result is written in different rows. The callbacks returns item. I have tried many methods but none seems to work.
def parse_companies(self, response):
data = json.loads(response.body)
if data:
item = ThalamusItem()
for company in data:
comp_id = company["id"]
url = self.request_details_URL + str(comp_id) + ".json"
request = Request(url, callback=self.parse_company_details)
request.meta['item'] = item
yield request
url2 = self.request_contacts + str(comp_id)
yield Request(url2, callback=self.parse_company_contacts, meta={'item': item})

Since scrapy is asynchronious you need to chain your requests manually. For transfering data between requests you can use Request's meta attribute:
def parse(self, response):
item = dict()
item['name'] = 'foobar'
yield request('http://someurl.com', self.parse2,
meta={'item': item})
def parse2(self, response):
print(response.meta['item'])
# {'name': 'foobar'}
In your case you end up with a split chain when you should have one continuous chain.
Your code should look something like this:
def parse_companies(self, response):
data = json.loads(response.body)
if not data:
return
for company in data:
item = ThalamusItem()
comp_id = company["id"]
url = self.request_details_URL + str(comp_id) + ".json"
url2 = self.request_contacts + str(comp_id)
request = Request(url, callback=self.parse_details,
meta={'url2': url2, 'item': item})
yield request
def parse_details(self, response):
item = response.meta['item']
url2 = response.meta['url2']
item['details'] = '' # add details
yield Request(url2, callback=self.parse_contacts, meta={'item': item})
def parse_contacts(self, response):
item = response.meta['item']
item['contacts'] = '' # add details
yield item

Related

Scrapy callback asynchronous

def parse(self, response):
category_names = []
category_urls = []
for item in response.css("#zg_browseRoot ul li"):
category_url = item.css("a").css(self.CSS_URL).extract()
category_name = item.css("a").css(self.CSS_TEXT).extract()
category_url = [
self.parse_url(category_url, 4) for category_url in category_url
]
(category_url,) = category_url
(category_name,) = category_name
category_names.append(category_name)
category_urls.append(category_url)
for c_name, url in zip(category_names, category_urls):
self.c_name = [c_name]
yield scrapy.Request(url, callback=self.parse_categories)
def parse_url(self, url, number):
parse = urlparse(url)
split = parse.path.split("/")[:number]
return f'{self.BASE_URL}{"/".join(split)}'
def parse_categories(self, response):
sub_names = []
sub_urls = []
for item in response.css("#zg_browseRoot ul ul li"):
sub_name = item.css("a").css(self.CSS_TEXT).extract()
sub_url = item.css("a").css(self.CSS_URL).extract()
sub_url = [self.parse_url(sub_url, 5) for sub_url in sub_url]
(sub_url,) = sub_url
(sub_name,) = sub_name
sub_names.append(sub_name)
sub_urls.append(sub_url)
for sub_name, url in zip(sub_names, sub_urls):
self.sub_name = [sub_name]
# print("{}: {}, {}".format(url, self.sub_name, self.c_name))
yield scrapy.Request(url, callback=self.parse_subcategories)
def parse_subcategories(self, response):
url = self.parse_url(response.request.url, 5)
print(f"{self.c_name}, {self.sub_name}, {url}")
Hello everyone,
I'm having an issue with my Scrapy approach. I'm trying to scrape page which has categories and subcategories in which are items. I want to include category and subcategory with each item scraped.
The problem is that the Scrapys callback function is asynchronous and zipping the URLs with names doesn't seem to work, because the for loop is processed first, URLs are stored in a generator and names are staying behind. Can anyone help me to work around this?
Thanks in advance,
Daniel.
You can pass arbitrary data along with the requests by using th cb_kwargs parameter. You can read about the details here.
Here is a simplified example:
def parse(self, response):
rows = response.xpath('//div[#id="some-element"]')
for row in rows:
request_url = row.xpath('a/#href').get()
category = row.xpath('a/text()').get()
yield Request(
url=request_url,
callback=self.parse_category,
cb_kwargs={'category': category}
)
def parse_category(self, response, category): # Notice category arg in the func
# Process here
yield item
The data inserted in cb_kwargs is passed as a keyword arg into the callback function, so the key in the dict must match the name of the argument in the method definiton.
cb_kwargs were introduced in Scrapy v1.7, if you are using an older version you should use the meta param. You can read about it here, notice that the use is slightly different.

request to the same parse method

I don't understand how requests works. I don't know why scrapy.Request(next_page, callback=self.parse2) doesn't work as I thought. Pitem is response of this request, but I i have KeyError: 'pitem' I don't know why the first request works, but request2 not
MY CODE:
spider.py
...
def parse(self, response):
...
request = scrapy.Request(link, callback=self.parse2)
request.meta['item'] = item
yield request
...
def parse2(self, response)
item = response.meta['item']
pitem = response.meta['pitem']
...
pitem['field'].append(self.whatever)
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['pitem'] = item
yield request2
else:
yield pitem
self.whatever = []
request = scrapy.Request(link, callback=self.parse2)
request.meta['item'] = item
yield request
This does define a meta variable with the name item but not with pitem. So when you call function parse2 and say pitem = response.meta['pitem'] it is unable to find pitem in the request meta data.
One possible solution might be to use pitem = response.meta.get('pitem') which will return a None value if it's unable to find pitem but that depends very much on your usecase.

assign properties to wrong items in Scrapy

CODE
spider.py
...
def parse(self, response):
for one_item in response.xpath('path1'):
item = ProjectItem()
request = scrapy.Request(one_item.xpath('path2'), callback=self.parse2)
request.meta['item'] = item
yield request
property = []
def parse2(self, response)
item = response.meta['item']
for x in response.xpath('path3')
self.property.append('path4')
next_page = response.xpath('path5')
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['item'] = item
yield request2
else:
item['field'] = self.property
self.property = []
yield item
Problem is that when spider crawl to next_page. Some self.property is assign to wrong items. I don't know how to repair it.
self.property is a class attribute that is shared among all calls to parse2 and you can't control the order of each call to parse2 .
To solve that you need to pass the property list inside the meta or as a item attribute:
def parse(self, response):
for one_item in response.xpath('path1'):
item = ProjectItem()
item['field'] = []
request = scrapy.Request(one_item.xpath('path2'), callback=self.parse2)
request.meta['item'] = item
yield request
def parse2(self, response)
item = response.meta['item']
for x in response.xpath('path3')
item['field'].append('path4')
next_page = response.xpath('path5')
if next_page is not None:
request2 = scrapy.Request(next_page, callback=self.parse2)
request2.meta['item'] = item
yield request2
else:
yield item

Multiple pages per item - using scraped links

My spide looks like this/;
class ScrapeMovies(scrapy.Spider):
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = loopitem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
yield item
# This part is responsible for scraping all of the pages on a start url commented out for convinience
# next_page=response.xpath('//div[#class="page-nav-btm"]/ul/li[last()]/a/#href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
What it does as of know it scrapes the table (see the starting url). I want it to then go the link (members name column) and then extract some informations from this link (link is e.g. https://www.trekearth.com/members/monareng/) and the return this as an item.
How should i approach this?
If anything is unclear please do not hesitate to ask for clarification.
EDIT:
nowy my code looks as follows (however still does not work):
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
item['groups'] = response.xpath('//div[#class="groups-btm"]/ul/li/text()').extract_first()
return item
Use meta field to put item forward to next callback
def parse_page1(self, response):
item = MyItem(main_url=response.url)
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
UPD: to process all rows use a yield in your loop
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
yield request

Scrapy: Passing item between methods

Suppose I have a Bookitem, I need to add information to it in both the parse phase and detail phase
def parse(self, response)
data = json.loads(response)
for book in data['result']:
item = BookItem();
item['id'] = book['id']
url = book['url']
yield Request(url, callback=self.detail)
def detail(self,response):
hxs = HtmlXPathSelector(response)
item['price'] = ......
#I want to continue the same book item as from the for loop above
Using the code as is would led to undefined item in the detail phase. How can I pass the item to the detail? detail(self,response,item) doesn't seem to work.
There is an argument named meta for Request:
yield Request(url, callback=self.detail, meta={'item': item})
then in function detail, access it this way:
item = response.meta['item']
See more details here about jobs topic.
iMom0's approach still works, but as of scrapy 1.7, the recommended approach is to pass user-defined information through cb_kwargs and leave meta for middlewares, extensions, etc:
def parse(self, response):
....
yield Request(url, callback=self.detail, cb_kwargs={'item': item})
def detail(self,response, item):
item['price'] = ......
You could also pass the individual key-values into the cb_kwargs argument and then only instantiate the BookItem instance in the final callback (detail in this case):
def parse(self, response)
data = json.loads(response)
for book in data['result']:
yield Request(url,
callback=self.detail,
cb_kwargs=dict(id_=book['id'],
url=book['url']))
def detail(self,response, id_, url):
hxs = HtmlXPathSelector(response)
item = BookItem()
item['id'] = id_
item['url'] = url
item['price'] = ......
You can define variable in init method:
class MySpider(BaseSpider):
...
def __init__(self):
self.item = None
def parse(self, response)
data = json.loads(response)
for book in data['result']:
self.item = BookItem();
self.item['id'] = book['id']
url = book['url']
yield Request(url, callback=self.detail)
def detail(self, response):
hxs = HtmlXPathSelector(response)
self.item['price'] = ....

Categories