scrapy-splash CrawlSpider

scrapy-splash CrawlSpider - python

I tried to scraping the driver names through scrapy-splash CrawlSpider on this site, but I constantly come across errors. After searching for ways to solve the problem, I came across github and just copied the latest code.
start_urls = ['http://www.huananzhi.com/html/1/184/185/index.html']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url,callback=self.parse_item, args={'wait': 0.5}, meta={'real_url': url})
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r, response)
def use_splash(self, request, response):
request.meta.update(splash={
'args': {
'wait': 15,
},
'endpoint': 'render.html',
})
return request
linkRule = LinkExtractor(restrict_xpaths='//article /div[1]/div[1]/div[2]/a[1]')
itemRule = Rule(linkRule, callback='parse_item', follow=True, process_request='use_splash')
rules = (
itemRule,
)
def parse_item(self, response):
item = HuananzhiItem()
item['name'] = response.xpath("//div[#class='tab-content']//div[1]/h2/text()").get()
yield item
It didn't work so I tried using scrapy.Spider
def start_requests(self):
url = 'http://www.huananzhi.com/html/1/184/185/index.html'
yield SplashRequest(url=url, callback=self.parse)
def parse(self, response):
links = response.xpath('//article/div[1]/div[1]/div[2]/a[1]/#href')
for link in links:
yield SplashRequest(url=link, callback=self.parse_item)
next_page = response.xpath('//section//li[4]//a[1]')
yield from response.follow(next_page, self.parse)
def parse_item(self, response):
item = HuananzhiItem()
item['name'] = response.xpath("//div[#class='tab-content']//div[1]/h2/text()").get()
yield item
i also use scrapy-user-agent
can anyone tell me how to get the item ? Sorry for such a stupid question, I'm a beginner
Thanks

Related

Scrapy Splash Dynamic scraping with CrawlSpider

I tried to get some data from a react based website, but when I use CrawlSpider I can't parse other pages.
For Example I can parse my first URL with splash and other urls will parse regularly without dynamic content.
this is my code:
class PageSpider(CrawlSpider):
host = 'hooshmandsazeh.com'
protocol = 'https'
root_domain = 'hooshmandsazeh.com'
name = 'page'
allowed_domains = [host]
#start_urls = [f'{protocol}://{host}',]
def start_requests(self):
url = f'{self.protocol}://{self.host}'
yield SplashRequest(url, dont_process_response=True, args={'wait': 1}, meta={'real_url': url})
custom_settings = {
#'DEPTH_LIMIT': 9,
}
rules = (
# Rule(LinkExtractor(allow=('node_\d+\.htm',)), follow=True),
Rule(LinkExtractor(allow=(host),deny=('\.webp', '\.js', '\.css', '\.jpg', '\.png'),unique=True),
callback='parse',
follow=True,
process_request='splash_request'
),
)
def splash_request(self, request):
request.meta['real_url'] = request.url
print("Aliii",request.meta['real_url'])
return request
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
newresponse = response.replace(url=response.meta.get('real_url'))
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(newresponse)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def parse(self,response):
if len(LinkExtractor(deny = self.host).extract_links(response)) > 0:
loader = ItemLoader(item=PageLevelItem(), response=response)
loader.add_value('page_source_url', response.url)
yield loader.load_item()

Check below code worked for me:
def splash_request(self, request):
# request = request.replace(url=RENDER_HTML_URL + request.url)
request.meta['real_url'] = request.url
return SplashRequest(request.meta['real_url'], dont_process_response=True, args={'wait': 0}, meta={'real_url': request.meta['real_url']})

Failed to retrieve product listings pages from few categories

From this webpage I am trying to get that kind of link where different products are located. There are 6 categories having More info button which when I traverse recursively, I usually reach the target pages. This is one such product listings page I wish to get.
Please note that some of these pages have both product listing and more info buttons, which is why I failed to capture the product listing pages accurately.
Current spider looks like the following (fails to grab lots of product listings pages):
import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url, callback=self.parse)
def parse(self, response):
link_list = []
for item in response.css(".match-height a.more-info::attr(href)").getall():
if not "/detail/" in item:
inner_page_link = response.urljoin(item)
link_list.append(inner_page_link)
yield {"target_url":inner_page_link}
for new_link in link_list:
yield scrapy.Request(new_link, callback=self.parse)
Expected output (randomly taken):
https://www.norgren.com/de/en/list/directional-control-valves/in-line-and-manifold-valves
https://www.norgren.com/de/en/list/pressure-switches/electro-mechanical-pressure-switches
https://www.norgren.com/de/en/list/pressure-switches/electronic-pressure-switches
https://www.norgren.com/de/en/list/directional-control-valves/sub-base-valves
https://www.norgren.com/de/en/list/directional-control-valves/non-return-valves
https://www.norgren.com/de/en/list/directional-control-valves/valve-islands
https://www.norgren.com/de/en/list/air-preparation/combination-units-frl
How to get all the product listings pages from the six categories?

import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url)
def parse(self, response):
# check if there are items in the page
if response.xpath('//div[contains(#class, "item-list")]//div[#class="buttons"]/div[#class="more-information"]/a/#href'):
yield scrapy.Request(url=response.url, callback=self.get_links, dont_filter=True)
# follow "more info" buttons
for url in response.xpath('//a[text()="More info"]/#href').getall():
yield response.follow(url)
def get_links(self, response):
yield {"target_url": response.url}
next_page = response.xpath('//a[#class="next-button"]/#href').get()
if next_page:
yield response.follow(url=next_page, callback=self.get_links)

Maybe filter only pages that have at least one link to details? Here is an example of how to identify if a page meets the criteria you are searching for:
import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url, callback=self.parse)
def parse(self, response):
link_list = []
more_info_items = response.css(
".match-height a.more-info::attr(href)").getall()
detail_items = [item for item in more_info_items if '/detail/' in item]
if len(detail_items) > 0:
print(f'This is a link you are searching for: {response.url}')
for item in more_info_items:
if not "/detail/" in item:
inner_page_link = response.urljoin(item)
link_list.append(inner_page_link)
yield {"target_url": inner_page_link}
for new_link in link_list:
yield scrapy.Request(new_link, callback=self.parse)
I only printed the link to the console, but you can figure out how to log it to where you need.

next page crawl in Scrapy

I am trying to get some data from the website but my spider is not crawling to the next page even after a proper pagination link.
import scrapy
class NspiderSpider(scrapy.Spider):
name = "nspider"
allowed_domains = ["elimelechlab.yale.edu/"]
start_urls = ["https://elimelechlab.yale.edu/pub"]
def parse(self, response):
title = response.xpath(
'//*[#class="views-field views-field-title"]/span/text()'
).extract()
doi_link = response.xpath(
'//*[#class="views-field views-field-field-doi-link"]//a[1]/#href'
).extract()
yield {"paper_title": title, "doi_link": doi_link}
next_page = response.xpath(
'//*[#title="Go to next page"]/#href'
).extract_first() # extracting next page link
if next_page:
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)
PS: I don't want to use LinkExtractor.
Any help would be appreciated.

Nothing wrong with your next_page logic, code is just not reaching this because the yield for the item is in the same identation level. Try the following approach:
import scrapy
class NspiderSpider(scrapy.Spider):
name = "nspider"
allowed_domains = ["elimelechlab.yale.edu"]
start_urls = ["https://elimelechlab.yale.edu/pub"]
def parse(self, response):
for view in response.css('div.views-row'):
yield {
'paper_title': view.css('div.views-field-title span.field-content::text').get(),
'doi_link': view.css('div.views-field-field-doi-link div.field-content a::attr(href)').get()
}
next_page = response.xpath(
'//*[#title="Go to next page"]/#href'
).extract_first() # extracting next page link
if next_page:
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)

scrapy - parsing multiple times

I am trying a parse a domain that whose contents are as follows
Page 1 - contains links to 10 articles
Page 2 - contains links to 10 articles
Page 3 - contains links to 10 articles and so on...
My job is to parse all the articles on all pages.
My thought - Parse all the pages and store links to all the articles in a list and then iterate the list and parse the links.
So far I have been able to iterate through the pages, parse and collect links to the articles. I am stuck on how to start parsing this list.
My Code so far...
import scrapy
class DhoniSpider(scrapy.Spider):
name = "test"
start_urls = [
"https://www.news18.com/cricketnext/newstopics/ms-dhoni.html"
]
count = 0
def __init__(self, *a, **kw):
super(DhoniSpider, self).__init__(*a, **kw)
self.headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
self.seed_urls = []
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
self.seed_urls.extend(ref_links)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def iterate_urls(self):
for link in self.seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
def parse_page(self, response):
print("called")
how to iterate my self.seed_urls list and parse them? From where should I call my iterate_urls function?

Usually in this cases there is no need to make external function like your iterate_urls:
def parse(self, response):
DhoniSpider.count += 1
if DhoniSpider.count > 2 :
# there are many pages, this is just to stop parsing after 2 pages
return
for ul in response.css('div.t_newswrap'):
for ref_link in ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall():
yield scrapy.Request(response.urljoin(ref_link), headers=self.headers, callback=self.parse_page, priority = 5)
next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)
def parse_page(self, response):
print("called")

You don't have to collect the links to an array, you can just yield a scrapy.Request right after you parsed them. So instead of self.seed_urls.extend(ref_links), you can modify the following function:
def iterate_urls(self, seed_urls):
for link in seed_urls:
link = response.urljoin(link)
yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)
and call it:
...
for ul in response.css('div.t_newswrap'):
ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
yield iterate_urls(ref_links)
...

Scrapy - scrape of all of the item instead of 1 item

I need to scrape all of the items but only 1 item is scrape.
My code is working fine before but when I transfer it to other project which is same code this happens I don't know why
I need to get all of the items according to the page size in start_url
here's my working code
class HmSalesitemSpider(scrapy.Spider):
name = 'HM_salesitem'
allowed_domains = ['www2.hm.com']
start_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-
all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-
size=3002']
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
Please Help. Thank you

It seems you have problem with indents. Move yielding request to for loop:
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').get()
yield scrapy.Request(url=url, callback=self.parse_subpage)
Or this is a bit cleared version:
def parse(self, response):
for link in response.css('li.product-item a::attr(href)').extract():
yield response.follow(link, self.parse_subpage)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy-splash CrawlSpider - python

Related

Scrapy Splash Dynamic scraping with CrawlSpider

Failed to retrieve product listings pages from few categories

next page crawl in Scrapy

scrapy - parsing multiple times

Scrapy - scrape of all of the item instead of 1 item

Categories

Resources