Confusion using Xpath when scraping websites with Scrapy - python

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item

There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

Related

How to follow a hyper-refernce in scrapy if the href attribute contains a hash symbol

In my web-scraping project I have to scrape the football matches data from https://www.national-football-teams.com/country/67/2018/France.html
In order to navigate to matches data from the above url I have to follow a hyper-reference that has a hash in the url:
Matchesevent
The standard scrapy mechanism of following the links:
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
will produce a link that will not lead to the matches data:
https://www.national-football-teams.com/matches.html
I would appreciate any help. Since I am noobie to web-scraping and anything which has something to do with web-development, a more specific advice and/or a minimal working example is highly acknowledged.
For the completeness here is the complete code of my scrapy-spider:
import scrapy
class NationalFootballTeams(scrapy.Spider):
name = "nft"
start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']
def parse(self, response):
for country in response.xpath("//div[#class='row country-teams']/div[1]/ul/li/a"):
cntry = country.xpath("text()").extract_first().strip()
if cntry == 'France':
href = country.xpath("#href").extract_first()
yield response.follow(href, self.parse_country)
def parse_country(self, response):
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
print href
yield scrapy.Request(url=href, callback=self.parse_matches)
def parse_matches(self, response):
print response.xpath("//tr[#class='win']").extract()
When clicking that link, no new page or even new data is loaded, it's already in the html, but hidden. Clicking that link will call some javascript that hides the current tab and shows the new tab. So to get to the data, you don't need to follow any link, but just use a different xpath query. The match data is in the xpath //div[#id='matches'].

Parsing version number from developer website using scrapy in python

Im attempting to create a spider that scrapes the websites of third party software in order to create a repository of current version numbers. Here is my attempt at a script to get the current Firefox version number from the sites css. I am using Python 2.7
import scrapy
import html2text
from scrapy.selector import HtmlXPathSelector
class MozillaSpider(scrapy.Spider):
name = 'mozilla'
allowed_domains = ['mozilla.com']
start_urls = ['https://www.mozilla.org/en-US/firefox/notes/']
def parse(self, response):
hxs = HtmlXPathSelector(response)
version = hxs.select('//html[#id="data-latest-firefox"]/text()').extract()[0]
converter = html2text.HTML2Text()
converter.ignore_links = True
print(converter.handle(version))
Your xpath expression tries to select the html element with an id of data-latest-firefox, and then extract the text inside it. Such element doesn't exist, so you get an empty list.
What you want instead is to extract the value of the data-latest-firefox attribute of the html element. You can do that by using:
>>> response.xpath('//html/#data-latest-firefox').get()
'59.0.2'
Your xpath is wrong:
//html[#id="data-latest-firefox"]/text()
You are triying to select the html tag with id equals to data-latest-firefox and to extract its text. There is no html tag with that given id so it won't return nothing, what you need is:
'/html/#data-latest-firefox'
that means, select the html tag and retrieve its data-latest-firefox attribute
Also, you can simplify your parse method:
def parse(self, response):
version = response.xpath('/html/#data-latest-firefox').extract_first()
print(version)

Xpath selector in python Scrapy

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:
I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:
So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:
I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:
import scrapy
from stack.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "ah"
allowed_domains = ["ah.nl"]
start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
def parse(self, response):
for sel in response.xpath('//div[#class="product__description small-7 medium-12"]'):
item = DmozItem()
item['title'] = sel.xpath('h1/text()').extract()
yield item
Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.
You would have to use selenium for this task since all the elements are loaded in JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40)
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and #class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
print elem.text
title = response.xpath('//div[#class="product__description small-7 medium-12"]./h1/text').extract()[0]

Scrapy: Extract links and text

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()

scrape through website with href references

I am using scrapy, and I want to scrape through www.rentler.com. I have gone to the website and searched for the city that I am interested in, and here is the link of that search result:
https://www.rentler.com/search?Location=millcreek&MaxPrice=
Now, all of the listings that I am interested in are contained on that page, and I want to recursively step through them, one by one.
Each listing is listed under:
<body>/<div id="wrap">/<div class="container search-res">/<ul class="search-results"><li class="result">
each result has a <a class="search-result-link" href="/listing/288910">
I know that I need to create a rule for the crawlspider and have it look at that href and append it to the url. That way it could go to each page, and grab that data that I am interested in.
I think I need something like this:
rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)
UPDATE
*Thank you for the input. Here is what I now have, it seems to run but does not scrape:*
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from KSL.items import KSLitem
class KSL(CrawlSpider):
name = "ksl"
allowed_domains = ["https://www.rentler.com"]
start_urls = ["https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978"]
regex_pattern = '<a href="listing/(.*?) class="search-result-link">'
def parse_item(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = re.findall(regex_pattern, "https://www.rentler.com/search?location=millcreek&MaxPrice=")
for site in sites:
item = KSLitem()
item['price'] = site.select('//div[#class="price"]/text()').extract()
item['address'] = site.select('//div[#class="address"]/text()').extract()
item['stats'] = site.select('//ul[#class="basic-stats"]/li/div[#class="count"]/text()').extract()
item['description'] = site.select('//div[#class="description"]/div/p/text()').extract()
items.append(item)
return items
Thoughts?
If you need to scrape data out a html files, which is the case, I would recommend using BeautifulSoup, it's very easy to install and to use:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html)
for link in bs.find_all('a'):
if link.has_attr('href'):
print link.attrs['href']
This little script would get all href that are inside a HTML tag.
Edit: Fully functional script:
I tested this on my computer and the result was as expected, BeautifulSoup needs plain HTML and you can scrape what you need out of it, take a look at this code:
import requests
from bs4 import BeautifulSoup
html = requests.get(
'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print link.attrs['href']
That only shows you how to scrape href out of the html page you are trying to scrape, of course you can use it inside scrapy, as I told you, BeautifulSoup only needs plain HTML, that is why I use requests.get(url).text and you can scrape out of that. So I guess scrapy can pass that plain HTML to BeautifulSoup.
Edit 2
Ok, look I don't think you need scrapy at all, so if the previous script gets you all the links that you want to take data from works, you only need to do something like this:
supposing I have a valid list of urls I want to get specific data from, say price, acres, address... You could have this with the previous script only instead of printing urls to screen you could append them to a list and append only the ones that start with /listing/. That way you have a valid list of urls.
for url in valid_urls:
bs = BeautifulSoup(requests.get(url).text)
price = bs.find('span', {'class': 'amount'}).text
print price
You only need to look at the source code and you'll get the idea of how to scrape the data you need from every single url.
You can use a regular expression to find all the rental home ids from the links. From there, you can use the ids you have and scrape that page instead.
import re
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id
EDIT:
Here's a working-on-its-own version of the code. It prints all the link ids. You can use it as-is.
import re
import urllib
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source = urllib.urlopen(url_to_scrape).read()
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, page_source)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id

Categories