Scrapy: skip hrefs w/ missing scheme - python

So I'm trying to crawl the popular.ebay.com page and I get an error:Missing scheme in request url: #mainContent for the # anchor links.
The following is my code:
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(url, callback = self.parse_page, meta=meta,)
I can't add the base url to #mainContent, because it adds a double URL to the urls with with the full url scheme. I end up getting urls like this http://popular.ebay.comhttp://www.ebay.com/sch/i.html?_nkw=grande+mansion
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(response.url, callback = self.parse_page, meta=meta,)
The links I want to get look like this: Antique Chairs | but I get the error cause of links like this: <a id="gh-hdn-stm" class="gh-acc-a" href="#mainContent">Skip to main content</a>
How would I go about adding the base url to only the hash anchor links, or ignore links without the base url in them? For a simple solution I've tried the set rule deny=(#mainContent) and restrict_xpaths, but the crawler still spits the same error.

error:Missing scheme in request url: #mainContent is caused by requesting a url without a scheme (the "http://" part of the url).
#mainContent is an internal link, referring to a HTML element with the id "mainContent". You're probably not wanting to follow these links, as it's only linking to a different part of the current page you're on.
I'd suggest looking at this part of the documentation http://doc.scrapy.org/en/latest/topics/link-extractors.html#scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor. You can tell Scrapy to follow links which conform to a certain format and restrict what part of the page it will fetch links from. Take note of the "restrict_xpaths" and "allow" parameters.
Hope this helps :)

In your for loop:
meta = {'anchor_text':anchor_text,}
url = link.select('./#href').extract()[0]
if not '#' in url: // or if url[0] != '#'
yield Request(response.url, callback = self.parse_page, meta=meta,)
This will avoid yielding #foobar as an URL. You could add the base url to the #foobar in an else statement, but since this will redirect to a page scrapy has already scraped I don't think there's a point in it.

I found links other than #mainContent that were missing the scheme, so using #Robin's logic I made sure that the url contained the base url before parse_page.
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
if domain in url:
yield Request(url, callback = self.parse_page, meta=meta,)

Related

How to get links inside links from webpage in python?

How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.

How to scrape 2 web page with same domain on scrapy using python?

Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.

Scrapy XHR Pagination on TripAdvisor

Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.

Scrapy Request object is not always followed

I am creating a crawler with scraper.
My spider must go to start page which contains a list of links and link for next page.
Then, it must follow each link, go to this link, get infos and return to main page.
Finally, when spider followed each link of the page, it go to next page and begin again.
class jiwire(CrawlSpider):
name = "example"
allowed_domains = ["example.ndd"]
start_urls = ["page.example.ndd"]
rules = (Rule (SgmlLinkExtractor(allow=("next-page\.htm", ),restrict_xpaths=('//div[#class="paging"]',)), callback="parse_items", follow= True),)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//td[#class="desc"]')
for link in links :
link = title.select("h3/a/#href").extract()
request = Request("http://v4.jiwire.com/" + str(name), callback=self.parse_sub)
return(request)
def parse_sub(self, response):
hxs = HtmlXPathSelector(response)
name = hxs.select('//div[#id="content"]/div[#class="header"]/h2/text()').extract()
print name
I exmplain my code : I defined a rule to follow next pages.
To follow each link of current page, I created a request object with the link getted and I return this object.
normally, for each request return, I must see "print name" in parse_sub function.
But only ONE link has been follow (and no all), I don't understand why.
It crawl fine the link, request object is created fine but it enter in parse_sub only once per page.
Can you help me ?
thanks a lot
I am back ! my problem come from my return statement.
The solution:
for link in links :
link = title.select("h3/a/#href").extract()
request = Request(link, callback=self.parse_hotspot)
yield request

Combining base url with resultant href in scrapy

below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[#class="bookListingBookTitle"]/a/#href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
An alternative solution, if you don't want to use urlparse:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.
more info
Quote from docs:
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.
Also, you can pass <a> element directly as argument.

Categories