I am trying to scrape a website using Scrapy. Example Link is: Here.
I am able to get some data using css selectors. I also need to fetch all image urls of each item. Now an item can have multiple colours. When we click on another colour, it actually fetch images from another url in the browser. So, I need to generate manual requests (due to multiple colours) and attach "meta" to store image urls from others urls into a SINGLE ITEM FIELD.
Here is my Scrapy code:
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
#print(item['image_urls'])
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
# Now get the recursive img urls
response.meta['item'] = output
self.get_image_urls(response)
return output
Ideally, I should return output object to have all of the required data. My question is why I am not getting output['image_urls']? Because when I uncomment print statement in get_image_urls function, I see 3 crawled urls and 3 print statements with url appended after each other. I need them in the parse function. I'm not sure if I'm able to dictate my issue. Can anybody help?
Your parse method is returning the output before the get_image_urls requests are done.
You should only yield or return your final item and at the end of your recursive logic. Something like this should work:
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
yield Request(response.url, callback=self.get_image_urls, meta={'item': item}, dont_filter=True)
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
if len(next_url) > 0:
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
else:
yield item
Related
Goal: I want to retrieve order performance data published on a specific e-commerce site. Since those data are spread across multiple pages for each order performance, we would like to extract the information for each page and finally summarize them as a single item or record.
I have looked through the official documentation and other similar Q.A.'s and found some.
From the information, I was able to get an idea that it might be possible to achieve this by using cb_kwargs.
However, I could not understand what is wrong with the following code.
[python - Interpreting callbacks and cb_kwargs with scrapy - Stack
Overflow]
(Interpreting callbacks and cb_kwargs with scrapy)
[python - Multiple pages per item in Scrapy.
(https://stackoverflow.com/questions/22201876/multiple-pages-per-item-in-scrapy?noredirect=1&lq=1)
The program runs, but the csv outputs nothing as shown in the image below.
enter image description here
The order results page contains information on 30 items per page.
I would like to first retrieve all of the signup dates for each, which are listed only on the first page, then move to each product page from there to retrieve the details, and then store those information one item at a time.
I am a beginner who started writing code in Python 3 months ago.
So I may have some problems with basic understanding about classes, etc.
I would appreciate it if you could point this out to me during our discussion.
The official documentation of scrapy is too unfriendly for beginners and I am having a hard time with it.
def parse_firstpage_item(self, response):
request = scrapy.Request(
url=response.url,
callback=self.parse_productpage_item,
cb_kwargs=dict(product_URL='//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a'))
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_xpath("Conversion_date", '//*[#id="buyeritemtable"]/div/ul/li[2]/p[3]/text()')
yield loader.load_item()
def parse_productpage_item(self, response, product_URL):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_xpath("brand_name", 'normalize-space(//*[#id="s_brand"]/dd/a/text())')
loader.add_value("page_URL" , response.url)
loader.add_xpath("inquire" , '//*[#id="tabmenu_inqcnt"]/text()')
yield loader.load_item()
class MyLinkExtractor(LinkExtractor):
def extract_links(self, response):
base_url = get_base_url(response)
if self.restrict_xpaths:
docs = [
subdoc
for x in self.restrict_xpaths
for subdoc in response.xpath(x)
]
else:
docs = [response.selector]
all_links = []
for doc in docs:
links = self._extract_links(doc, response.url, response.encoding, base_url)
all_links.extend(self._process_links(links))
logging.info('='*100)
logging.info(all_links)
logging.info(f'total liks len: {len(all_links)}')
logging.info('='*100)
return all_links
class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata'
allowed_domains = ['www.buyma.com']
# start_urls = ['https://www.buyma.com/buyer/9887867/sales_1.html']
rules = (
Rule(MyLinkExtractor(
restrict_xpaths='//*[#class="buyeritem_name"]/a'), callback='parse_firstpage_item', follow=False),
Rule(LinkExtractor(restrict_xpaths='//DIV[#class="pager"]/DIV/A[contains(text(),"次")]'),follow=False)
)
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)]
#if lnk not in seen]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def start_requests(self):
with open('/Users/morni/buyma_researchtool/buyma_researchtool/AllshoppersURL.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
yield scrapy.Request(url = str(row[2])[:-5]+'/sales_1.html')
for row in self.reader:
for n in range(1, 300):
url = f'{self.base_page}{row}/sales_{n}.html'
yield scrapy.Request(
url=url,
callback=self.parse_firstpage_item,
errback=self.errback_httpbin,
dont_filter=True
)
def parse_firstpage_item(self, response):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_xpath("Conversion_date", '//*[#id="buyeritemtable"]/div/ul/li[2]/p[3]/text()')
loader.add_xpath("product_name" , '//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()')
loader.add_value("product_URL" , '//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/#href')
item = loader.load_item()
yield scrapy.Request(
url=response.urljoin(item['product_URL']),
callback=self.parse_productpage_item,
cb_kwargs={'item': item},
)
def parse_productpage_item(self, response, item):
loader = ItemLoader(item=item, response = response)
loader.add_xpath("brand_name", 'normalize-space(//*[#id="s_brand"]/dd/a/text())')
〜
yield loader.load_item()
You need to call each page and pass your current item to callback:
def parse_first_page(self, response):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_xpath("brand_name", 'normalize-space(//*[#id="s_brand"]/dd/a/text())')
loader.add_value("page_URL" , response.url)
loader.add_xpath("inquire" , '//*[#id="tabmenu_inqcnt"]/text()')
item = loader.load_item()
yield scrapy. Request(
url=second_page_url,
callback=self.parse_second_page,
cb_kwargs={'item': item},
)
def parse_second_page(self, response, item):
loader = ItemLoader(item=item, response=response)
loader.add_xpath("Conversion_date", '//*[#id="buyeritemtable"]/div/ul/li[2]/p[3]/text()')
item = loader.load_item()
yield scrapy. Request(
url=third_page_url,
callback=self.parse_third_page,
cb_kwargs={'item': item},
)
def parse_third_page(self, response, item):
loader = ItemLoader(item=item, response=response)
loader.add_value('ThirdUrl', response.url)
yield loader.load_item()
My spide looks like this/;
class ScrapeMovies(scrapy.Spider):
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = loopitem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
yield item
# This part is responsible for scraping all of the pages on a start url commented out for convinience
# next_page=response.xpath('//div[#class="page-nav-btm"]/ul/li[last()]/a/#href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
What it does as of know it scrapes the table (see the starting url). I want it to then go the link (members name column) and then extract some informations from this link (link is e.g. https://www.trekearth.com/members/monareng/) and the return this as an item.
How should i approach this?
If anything is unclear please do not hesitate to ask for clarification.
EDIT:
nowy my code looks as follows (however still does not work):
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
item['groups'] = response.xpath('//div[#class="groups-btm"]/ul/li/text()').extract_first()
return item
Use meta field to put item forward to next callback
def parse_page1(self, response):
item = MyItem(main_url=response.url)
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
UPD: to process all rows use a yield in your loop
for row in response.xpath('//table[#class="member-table"]//tr[position() > 1]'):
item = FinalItem()
website = row.xpath('./td[2]//a/#href/text()').extract_first()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
request.meta['item'] = item
yield request
I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls.
My script gets the list of first 4 items on the page, their urls, makes the request, parses add info, but then not returning into the loop and so spider closes.
Could somebody help me in solving this? I'm pretty new to this kind of stuff, and ask here when totally stuck.
Here is my code:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy_sokos.items import SokosItem
class SokosSpider(Spider):
name = "sokos"
allowed_domains = ["sokos.fi"]
base_url = "http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=%s&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151"
start_urls = [
"http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=0&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151",
]
for i in range(0, 8, 4):
start_urls.append((base_url) % str(i))
def parse(self, response):
products = Selector(response).xpath('//div[#class="product-listing product-grid"]/article[#class="product product-thumbnail"]')
for product in products:
item = SokosItem()
item['url'] = product.xpath('//div[#class="content"]/a[#class="image"]/#href').extract()[0]
yield Request(url = item['url'], meta = {'item': item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
item = response.meta['item']
item['name'] = Selector(response).xpath('//h1[#class="productTitle"]/text()').extract()[0].strip()
item['description'] = Selector(response).xpath('//div[#id="kuvaus"]/p/text()').extract()[0]
euro = Selector(response).xpath('//strong[#class="special-price"]/span[#class="euros"]/text()').extract()[0]
cent = Selector(response).xpath('//strong[#class="special-price"]/span[#class="cents"]/text()').extract()[0]
item['price'] = '.'.join(euro + cent)
item['number'] = Selector(response).xpath('//#data-productid').extract()[0]
yield item
The AJAX requests you are simulating are caught by the Scrapy "duplicate url filter".
Set dont_filter to True when yielding a Request:
yield Request(url=item['url'],
meta={'item': item},
callback=self.parse_additional_info,
dont_filter=True)
I'm a junior in python.I have some question about spider.
I have catch some URL and i put in my list object, then i would like to using the URL to do Scrapy again , is that possible dynamic changing the URL and keep doing Scrapy. Or someone can give me a idea about "Scrapy", thanks a lot .
'def parse(self,response):
sel=Selector(response)
sites=sel.xpath('//tr/td/span[#class="artist-lists"]')
items = []
for site in sites:
item=Website()
title=site.xpath('a/text()').extract()
link=site.xpath('a/#href').extract()
desc=site.xpath('text()').extract()
item['title']=title[0].encode('big5')
item['link']= link[0]
self.get_userLink(item['link'])
item['desc']=desc
# items.append(item)
#return items
def get_userLink(self,link):
#start_urls=[link]
self.parse(link)
sel=Selector(link)
sites=sel.xpath('//table/tr/td/b')
print sites
#for site in sites:
#print site.xpath('a/#href').extract() + "\n"
#print site.xpath('a/text()').extract()+ "\n"`
You can use the yield request that is parse the url to call other function.
def parse(self, response):
hxs = HtmlXPathSelector(response)
url= sel.xpath('//..../#href').extract()
if url:
It checks whether the url is correct or not.
yield Request(url, callback=self.parse_sub)
def parse_sub(self, response):
** Some Scrape operation here **
I'm trying to populate item using ItemLoader parsing data from multiple pages. But as I can see now, I can't change selector that I used when I initialized ItemLoader. And documentation says about selector attribute:
selector
The Selector object to extract data from. It’s either the
selector given in the constructor or one created from the response
given in the constructor using the default_selector_class. This
attribute is meant to be read-only.
Here's example code:
def parse(self, response):
sel = Selector(response)
videos = sel.xpath('//div[#class="video"]')
for video in videos:
loader = ItemLoader(VideoItem(), videos)
loader.add_xpath('original_title', './/u/text()')
loader.add_xpath('original_id', './/a[#class="hRotator"]/#href', re=r'movies/(\d+)/.+\.html')
try:
url = video.xpath('.//a[#class="hRotator"]/#href').extract()[0]
request = Request(url,
callback=self.parse_video_page)
except IndexError:
pass
request.meta['loader'] = loader
yield request
pages = sel.xpath('//div[#class="pager"]//a/#href').extract()
for page in pages:
url = urlparse.urljoin('http://www.mysite.com/', page)
request = Request(url, callback=self.parse)
yield request
def parse_video_page(self, response):
loader = response.meta['loader']
sel = Selector(response)
loader.add_xpath('original_description', '//*[#id="videoInfo"]//td[#class="desc"]/h2/text()')
loader.add_xpath('duration', '//*[#id="video-info"]/div[2]/text()')
loader.add_xpath('tags', '//*[#id="tags"]//a/text()')
item = loader.load_item()
return item
As for now, I can't scrape info from the second page.
Answering to your question directly - to change selector for ItemLoader you can set new selector object to loader.selector attribute.
def parse_video_page(self, response):
loader = response.meta['loader']
sel = Selector(response)
loader.selector = sel
loader.add_xpath(
'original_description',
'//*[#id="videoInfo"]//td[#class="desc"]/h2/text()'
)
# ...
But this way of working with loader objects seems to be unexpected and thus - not supported - library updates can break this code or produce unexpected bugs. Also passing loader to request meta is a bad thing to do, because loader object references response object - and this can cause memory problems in some situations.
Much more correct way of collecting item fields in several callbacks would be as follows (note the comments):
def parse(self, response):
sel = Selector(response)
videos = sel.xpath('//div[#class="video"]')
for video in videos:
try:
url = video.xpath('.//a[#class="hRotator"]/#href').extract()[0]
except IndexError:
continue
loader = ItemLoader(VideoItem(), videos)
loader.add_xpath('original_title', './/u/text()')
loader.add_xpath(
'original_id',
'.//a[#class="hRotator"]/#href',
re=r'movies/(\d+)/.+\.html'
)
item = loader.load_item()
yield Request(
urlparse.urljoin(response.url, url),
callback=self.parse_video_page,
# Note: item passed to the meta dict, not loader itself
meta={'item': item}
)
pages = sel.xpath('//div[#class="pager"]//a/#href').extract()
for page in pages:
url = urlparse.urljoin('http://www.mysite.com/', page)
yield Request(url, callback=self.parse)
def parse_video_page(self, response):
item = response.meta['item']
# Note: new loader object created,
# item from response.meta is passed to the constructor
loader = ItemLoader(item, response=response)
loader.add_xpath(
'original_description',
'//*[#id="videoInfo"]//td[#class="desc"]/h2/text()'
)
loader.add_xpath(
'duration',
'//*[#id="video-info"]/div[2]/text()'
)
loader.add_xpath('tags', '//*[#id="tags"]//a/text()')
return loader.load_item()