Scrapy: Extract links and text

Scrapy: Extract links and text - python

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?

There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)

use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()

Related

How to iterate over divs in Scrapy?

It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item

Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.

IndexError when using Scrapy for absolute links

I am scraping a webpage from Wikipedia (particularly this one) using a Python library called Scrapy. Here was the original code:
import scrapy
from wikipedia.items import WikipediaItem
class MySpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["en.wikipedia.org/"]
start_urls = [
'https://en.wikipedia.org/wiki/Category:2013_films',
]
def parse(self, response):
titles = response.xpath('//div[#id="mw-pages"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = title.xpath("a/#href").extract()
items.append(item)
return items
Then in the terminal, I ran scrapy crawl wiki -o wiki.json -t json to output the data to a JSON file. While the code worked, the links assigned to the "url" keys were all relative links. (i.e.: {"url": ["/wiki/9_Full_Moons"], "title": ["9 Full Moons"]}).
Instead of /wiki/9_Full_Moons, I needed http://en.wikipedia.org/wiki/9_Full_Moons. So I modified the above mentioned code to import the urljoin from the urlparse library. I also modified my for loop to look like this instead:
for title in titles:
item = WikipediaItem()
url = title.xpath("a/#href").extract()
item["title"] = title.xpath("a/text()").extract()
item["url"] = urljoin("http://en.wikipedia.org", url[0])
items.append(item)
return(items)
I believed this was the correct approach since the type of data assigned to the url key is enclosed in brackets (which would entail a list, right?) so to get the string inside it, I typed url[0]. However, this time I got an IndexError that looked like this:
IndexError: list index out of range
Can someone help explain where I went wrong?

So after mirroring the code to the example given in the documentation here, I was able to get the code to work:
def parse(self, response):
for text in response.xpath('//div[#id="mw-pages"]//li/a/text()').extract():
yield WikipediaItem(title=text)
for href in response.xpath('//div[#id="mw-pages"]//li/a/#href').extract():
link = urljoin("http://en.wikipedia.org", href)
yield WikipediaItem(url=link)
If anyone needs further clarification on how the Items class works, the documentation is here.
Furthermore, although the code works, it won't pair the title with its respective link. So it will give you
TITLE, TITLE, TITLE, LINK, LINK, LINK
instead of
TITLE, LINK, TITLE, LINK, TITLE, LINK
(the latter being probably the more desired result) — but that's for another question. If anyone has a proposed solution that works better than mine, I'll be more than happy to listen to your answers! Thanks.

I think you can just concatenate the two strings instead of using urljoin. Try this:
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)

On your first iteration of code with relative links, you used the xpath method: item["url"] = title.xpath("a/#href").extract()
The object returned is (I assume) a list of strings, so indexing it would be valid.
In the new iteration, you used the select method: url = title.select("a/#href").extract() Then you treated the returned object as an iterable, with url[0]. Check what the select method returns, maybe it's a list, like in the previous example.
P.S.: IPython is your friend.

For better clarification, i am going to modify above code,
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)

My Scrapy is not Scraping anything (blank csv file)

I'm trying to scrap top 100 t20 batsmen from icc site however the csv file I'm getting is blank. There are no errors in my code (at least I don't know about them).
Here is my item file
import scrapy
class DmozItem(scrapy.Item):
Ranking = scrapy.Field()
Rating = scrapy.Field()
Name = scrapy.Field()
Nationality = scrapy.Field()
Carer_Best_Rating = scrapy.Field()
dmoz_spider file
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "espn"
allowed_domains = ["relianceiccrankings.com"]
start_urls = ["http://www.relianceiccrankings.com/ranking/t20/batting/"]
def parse(self, response):
#sel = response.selector
#for tr in sel.css("table.top100table>tbody>tr"):
for tr in response.xpath('//table[#class="top100table"]/tr'):
item = DmozItem()
item['Ranking'] = tr.xpath('//td[#class="top100id"]/text()').extract_first()
item['Rating'] = tr.xpath('//td[#class="top100rating"]/text()').extract_first()
item['Name'] = tr.xpath('td[#class="top100name"]/a/text()').extract_first()
item['Nationality'] = tr.xpath('//td[#class="top100nation"]/text()').extract_first()
item['Carer_Best_Rating'] = tr.xpath('//td[#class="top100cbr"]/text()').extract_first()
yield item
what is wrong with my code?

The website you're trying to scrap had a frame in it which is the one you want to scrap.
start_urls = [
"http://www.relianceiccrankings.com/ranking/t20/batting/"
]
This is the correct URL
Also there is a lot more stuff wrong going on,
To select elements you should use the response itself, you don't need to initiate a variable with response.selector just select it straight from response.xpath(//foo/bar)
Your css selector for the table is wrong. top100table is a class rather than an id therefore is should be .top100table and not #top100table.
Here just have the xpath for it:
response.xpath("//table[#class='top100table']/tr")
tbody isn't part of the html code, it only appears when you inspect with a modern browser.
The extract() method always returns a list rather then the element itself so you need to extract the first element you find like this:
item['Ranking'] = tr.xpath('td[#class="top100id"]/a/text()').extract_first()
Hope this helps, have fun scraping!

To answer your ranking problem, the xpath for Ranking starts with '//...' which means 'from the start of the page'. You need it to be relative to tr instead. Simply remove the '//' from every xpath in the for loop.
item['Ranking'] = tr.xpath('td[#class="top100id"]/text()').extract_first()

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item

There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

Xpath selector in python Scrapy

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:
I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:
So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:
I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:
import scrapy
from stack.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "ah"
allowed_domains = ["ah.nl"]
start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
def parse(self, response):
for sel in response.xpath('//div[#class="product__description small-7 medium-12"]'):
item = DmozItem()
item['title'] = sel.xpath('h1/text()').extract()
yield item
Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.

You would have to use selenium for this task since all the elements are loaded in JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40)
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and #class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
print elem.text

title = response.xpath('//div[#class="product__description small-7 medium-12"]./h1/text').extract()[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: Extract links and text - python

use this.... item['name'] = sel.xpath('//a/text()').extract() item['link'] = sel.xpath('//a/#href').extract()

Related

How to iterate over divs in Scrapy?

IndexError when using Scrapy for absolute links

My Scrapy is not Scraping anything (blank csv file)

Confusion using Xpath when scraping websites with Scrapy

Xpath selector in python Scrapy

Categories

Resources