My Scrapy is not Scraping anything (blank csv file) - python

I'm trying to scrap top 100 t20 batsmen from icc site however the csv file I'm getting is blank. There are no errors in my code (at least I don't know about them).
Here is my item file
import scrapy
class DmozItem(scrapy.Item):
Ranking = scrapy.Field()
Rating = scrapy.Field()
Name = scrapy.Field()
Nationality = scrapy.Field()
Carer_Best_Rating = scrapy.Field()
dmoz_spider file
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "espn"
allowed_domains = ["relianceiccrankings.com"]
start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]
def parse(self, response):
#sel = response.selector
#for tr in sel.css("table.top100table>tbody>tr"):
for tr in response.xpath('//table[#class="top100table"]/tr'):
item = DmozItem()
item['Ranking'] = tr.xpath('//td[#class="top100id"]/text()').extract_first()
item['Rating'] = tr.xpath('//td[#class="top100rating"]/text()').extract_first()
item['Name'] = tr.xpath('td[#class="top100name"]/a/text()').extract_first()
item['Nationality'] = tr.xpath('//td[#class="top100nation"]/text()').extract_first()
item['Carer_Best_Rating'] = tr.xpath('//td[#class="top100cbr"]/text()').extract_first()
yield item
what is wrong with my code?

The website you're trying to scrap had a frame in it which is the one you want to scrap.
start_urls = [
"http://www.relianceiccrankings.com/ranking/t20/batting/"
]
This is the correct URL
Also there is a lot more stuff wrong going on,
To select elements you should use the response itself, you don't need to initiate a variable with response.selector just select it straight from response.xpath(//foo/bar)
Your css selector for the table is wrong. top100table is a class rather than an id therefore is should be .top100table and not #top100table.
Here just have the xpath for it:
response.xpath("//table[#class='top100table']/tr")
tbody isn't part of the html code, it only appears when you inspect with a modern browser.
The extract() method always returns a list rather then the element itself so you need to extract the first element you find like this:
item['Ranking'] = tr.xpath('td[#class="top100id"]/a/text()').extract_first()
Hope this helps, have fun scraping!

To answer your ranking problem, the xpath for Ranking starts with '//...' which means 'from the start of the page'. You need it to be relative to tr instead. Simply remove the '//' from every xpath in the for loop.
item['Ranking'] = tr.xpath('td[#class="top100id"]/text()').extract_first()

Related

Scrapy spider not scraping the data correctly

I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

How to iterate over divs in Scrapy?

It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item
Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item
There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

Xpath selector in python Scrapy

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:
I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:
So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:
I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:
import scrapy
from stack.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "ah"
allowed_domains = ["ah.nl"]
start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
def parse(self, response):
for sel in response.xpath('//div[#class="product__description small-7 medium-12"]'):
item = DmozItem()
item['title'] = sel.xpath('h1/text()').extract()
yield item
Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.
You would have to use selenium for this task since all the elements are loaded in JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40)
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and #class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
print elem.text
title = response.xpath('//div[#class="product__description small-7 medium-12"]./h1/text').extract()[0]

Scrapy: Extract links and text

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()

Categories