It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item
Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.
Related
I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I am scraping a webpage from Wikipedia (particularly this one) using a Python library called Scrapy. Here was the original code:
import scrapy
from wikipedia.items import WikipediaItem
class MySpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["en.wikipedia.org/"]
start_urls = [
'https://en.wikipedia.org/wiki/Category:2013_films',
]
def parse(self, response):
titles = response.xpath('//div[#id="mw-pages"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = title.xpath("a/#href").extract()
items.append(item)
return items
Then in the terminal, I ran scrapy crawl wiki -o wiki.json -t json to output the data to a JSON file. While the code worked, the links assigned to the "url" keys were all relative links. (i.e.: {"url": ["/wiki/9_Full_Moons"], "title": ["9 Full Moons"]}).
Instead of /wiki/9_Full_Moons, I needed http://en.wikipedia.org/wiki/9_Full_Moons. So I modified the above mentioned code to import the urljoin from the urlparse library. I also modified my for loop to look like this instead:
for title in titles:
item = WikipediaItem()
url = title.xpath("a/#href").extract()
item["title"] = title.xpath("a/text()").extract()
item["url"] = urljoin("http://en.wikipedia.org", url[0])
items.append(item)
return(items)
I believed this was the correct approach since the type of data assigned to the url key is enclosed in brackets (which would entail a list, right?) so to get the string inside it, I typed url[0]. However, this time I got an IndexError that looked like this:
IndexError: list index out of range
Can someone help explain where I went wrong?
So after mirroring the code to the example given in the documentation here, I was able to get the code to work:
def parse(self, response):
for text in response.xpath('//div[#id="mw-pages"]//li/a/text()').extract():
yield WikipediaItem(title=text)
for href in response.xpath('//div[#id="mw-pages"]//li/a/#href').extract():
link = urljoin("http://en.wikipedia.org", href)
yield WikipediaItem(url=link)
If anyone needs further clarification on how the Items class works, the documentation is here.
Furthermore, although the code works, it won't pair the title with its respective link. So it will give you
TITLE, TITLE, TITLE, LINK, LINK, LINK
instead of
TITLE, LINK, TITLE, LINK, TITLE, LINK
(the latter being probably the more desired result) — but that's for another question. If anyone has a proposed solution that works better than mine, I'll be more than happy to listen to your answers! Thanks.
I think you can just concatenate the two strings instead of using urljoin. Try this:
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)
On your first iteration of code with relative links, you used the xpath method: item["url"] = title.xpath("a/#href").extract()
The object returned is (I assume) a list of strings, so indexing it would be valid.
In the new iteration, you used the select method: url = title.select("a/#href").extract() Then you treated the returned object as an iterable, with url[0]. Check what the select method returns, maybe it's a list, like in the previous example.
P.S.: IPython is your friend.
For better clarification, i am going to modify above code,
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)
I'm trying to scrap top 100 t20 batsmen from icc site however the csv file I'm getting is blank. There are no errors in my code (at least I don't know about them).
Here is my item file
import scrapy
class DmozItem(scrapy.Item):
Ranking = scrapy.Field()
Rating = scrapy.Field()
Name = scrapy.Field()
Nationality = scrapy.Field()
Carer_Best_Rating = scrapy.Field()
dmoz_spider file
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "espn"
allowed_domains = ["relianceiccrankings.com"]
start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]
def parse(self, response):
#sel = response.selector
#for tr in sel.css("table.top100table>tbody>tr"):
for tr in response.xpath('//table[#class="top100table"]/tr'):
item = DmozItem()
item['Ranking'] = tr.xpath('//td[#class="top100id"]/text()').extract_first()
item['Rating'] = tr.xpath('//td[#class="top100rating"]/text()').extract_first()
item['Name'] = tr.xpath('td[#class="top100name"]/a/text()').extract_first()
item['Nationality'] = tr.xpath('//td[#class="top100nation"]/text()').extract_first()
item['Carer_Best_Rating'] = tr.xpath('//td[#class="top100cbr"]/text()').extract_first()
yield item
what is wrong with my code?
The website you're trying to scrap had a frame in it which is the one you want to scrap.
start_urls = [
"http://www.relianceiccrankings.com/ranking/t20/batting/"
]
This is the correct URL
Also there is a lot more stuff wrong going on,
To select elements you should use the response itself, you don't need to initiate a variable with response.selector just select it straight from response.xpath(//foo/bar)
Your css selector for the table is wrong. top100table is a class rather than an id therefore is should be .top100table and not #top100table.
Here just have the xpath for it:
response.xpath("//table[#class='top100table']/tr")
tbody isn't part of the html code, it only appears when you inspect with a modern browser.
The extract() method always returns a list rather then the element itself so you need to extract the first element you find like this:
item['Ranking'] = tr.xpath('td[#class="top100id"]/a/text()').extract_first()
Hope this helps, have fun scraping!
To answer your ranking problem, the xpath for Ranking starts with '//...' which means 'from the start of the page'. You need it to be relative to tr instead. Simply remove the '//' from every xpath in the for loop.
item['Ranking'] = tr.xpath('td[#class="top100id"]/text()').extract_first()
Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:
I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:
So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:
I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:
import scrapy
from stack.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "ah"
allowed_domains = ["ah.nl"]
start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
def parse(self, response):
for sel in response.xpath('//div[#class="product__description small-7 medium-12"]'):
item = DmozItem()
item['title'] = sel.xpath('h1/text()').extract()
yield item
Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.
You would have to use selenium for this task since all the elements are loaded in JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40)
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and #class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
print elem.text
title = response.xpath('//div[#class="product__description small-7 medium-12"]./h1/text').extract()[0]
I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()