Xpath selector in python Scrapy - python

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:
I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:
So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:
I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:
import scrapy
from stack.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "ah"
allowed_domains = ["ah.nl"]
start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
def parse(self, response):
for sel in response.xpath('//div[#class="product__description small-7 medium-12"]'):
item = DmozItem()
item['title'] = sel.xpath('h1/text()').extract()
yield item
Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.

You would have to use selenium for this task since all the elements are loaded in JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40)
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and #class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
print elem.text

title = response.xpath('//div[#class="product__description small-7 medium-12"]./h1/text').extract()[0]

Related

Parsing version number from developer website using scrapy in python

Im attempting to create a spider that scrapes the websites of third party software in order to create a repository of current version numbers. Here is my attempt at a script to get the current Firefox version number from the sites css. I am using Python 2.7
import scrapy
import html2text
from scrapy.selector import HtmlXPathSelector
class MozillaSpider(scrapy.Spider):
name = 'mozilla'
allowed_domains = ['mozilla.com']
start_urls = ['https://www.mozilla.org/en-US/firefox/notes/']
def parse(self, response):
hxs = HtmlXPathSelector(response)
version = hxs.select('//html[#id="data-latest-firefox"]/text()').extract()[0]
converter = html2text.HTML2Text()
converter.ignore_links = True
print(converter.handle(version))
Your xpath expression tries to select the html element with an id of data-latest-firefox, and then extract the text inside it. Such element doesn't exist, so you get an empty list.
What you want instead is to extract the value of the data-latest-firefox attribute of the html element. You can do that by using:
>>> response.xpath('//html/#data-latest-firefox').get()
'59.0.2'
Your xpath is wrong:
//html[#id="data-latest-firefox"]/text()
You are triying to select the html tag with id equals to data-latest-firefox and to extract its text. There is no html tag with that given id so it won't return nothing, what you need is:
'/html/#data-latest-firefox'
that means, select the html tag and retrieve its data-latest-firefox attribute
Also, you can simplify your parse method:
def parse(self, response):
version = response.xpath('/html/#data-latest-firefox').extract_first()
print(version)

How to iterate over divs in Scrapy?

It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item
Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.

My Scrapy is not Scraping anything (blank csv file)

I'm trying to scrap top 100 t20 batsmen from icc site however the csv file I'm getting is blank. There are no errors in my code (at least I don't know about them).
Here is my item file
import scrapy
class DmozItem(scrapy.Item):
Ranking = scrapy.Field()
Rating = scrapy.Field()
Name = scrapy.Field()
Nationality = scrapy.Field()
Carer_Best_Rating = scrapy.Field()
dmoz_spider file
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "espn"
allowed_domains = ["relianceiccrankings.com"]
start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]
def parse(self, response):
#sel = response.selector
#for tr in sel.css("table.top100table>tbody>tr"):
for tr in response.xpath('//table[#class="top100table"]/tr'):
item = DmozItem()
item['Ranking'] = tr.xpath('//td[#class="top100id"]/text()').extract_first()
item['Rating'] = tr.xpath('//td[#class="top100rating"]/text()').extract_first()
item['Name'] = tr.xpath('td[#class="top100name"]/a/text()').extract_first()
item['Nationality'] = tr.xpath('//td[#class="top100nation"]/text()').extract_first()
item['Carer_Best_Rating'] = tr.xpath('//td[#class="top100cbr"]/text()').extract_first()
yield item
what is wrong with my code?
The website you're trying to scrap had a frame in it which is the one you want to scrap.
start_urls = [
"http://www.relianceiccrankings.com/ranking/t20/batting/"
]
This is the correct URL
Also there is a lot more stuff wrong going on,
To select elements you should use the response itself, you don't need to initiate a variable with response.selector just select it straight from response.xpath(//foo/bar)
Your css selector for the table is wrong. top100table is a class rather than an id therefore is should be .top100table and not #top100table.
Here just have the xpath for it:
response.xpath("//table[#class='top100table']/tr")
tbody isn't part of the html code, it only appears when you inspect with a modern browser.
The extract() method always returns a list rather then the element itself so you need to extract the first element you find like this:
item['Ranking'] = tr.xpath('td[#class="top100id"]/a/text()').extract_first()
Hope this helps, have fun scraping!
To answer your ranking problem, the xpath for Ranking starts with '//...' which means 'from the start of the page'. You need it to be relative to tr instead. Simply remove the '//' from every xpath in the for loop.
item['Ranking'] = tr.xpath('td[#class="top100id"]/text()').extract_first()

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item
There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

Scrapy: Extract links and text

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()

Categories