Using Scrapy to parse table page and extract data from underlying links - python

I am trying to scrape the underlying data on the table in the following pages: https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries
What I want to do is access the underlying link for each row, and capture:
The ID tag (e.g. QDE001),
The name
The reason for listing / additional information
Other linked entities
This is what I have, but it does not seems to be working, I keep getting a "NotImplementedError('{}.parse callback is notdefined'.format(self.class.name)).I believe that the Xpaths I have defined are OK, not sure what I am missing.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UNSCItem(scrapy.Item):
name = scrapy.Field()
uid = scrapy.Field()
link = scrapy.Field()
reason = scrapy.Field()
add_info = scrapy.Field()
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = [
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]
rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')
def data_extract(self, response):
item = UNSCItem()
name = response.xpath('//*[#id="content"]/article/div[3]/div//text()').extract()
uid = response.xpath('//*[#id="content"]/article/div[2]/div/div//text()').extract()
reason = response.xpath('//*[#id="content"]/article/div[6]/div[2]/div//text()').extract()
add_info = response.xpath('//*[#id="content"]/article/div[7]//text()').extract()
related = response.xpath('//*[#id="content"]/article/div[8]/div[2]//text()').extract()
yield item

Try the below approach. It should fetch you all the ids and corresponding names from all the six pages. I suppose, the rest of the fields you can manage yourself.
Just run it as it is:
import scrapy
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}

Related

Scrapy list of links

I am building a spider with scrapy, I want to access in every item in a list and then scrape all the data inside each link. but when I run the spider it doesn´t scrape the data. What I am missing?
from ..items import JobscraperItem
from scrapy.linkextractors import LinkExtractor
class JobscraperSpider(scrapy.Spider):
name ='jobspider'
start_urls = ['https://cccc/bolsa/ofertas?oferta=&lugar=&categoria=']
def parse(self, response):
job_detail = response.xpath('//div[#class="list"]/div/a')
yield from response.follow_all(job_detail, self.parse_jobspider)
def parse(self, response):
items = JobscraperItem()
job_title = response.xpath('//h1/text()').extract()
company = response.xpath('//h2/b/text()').extract()
company_url = response.xpath('//div[#class="pull-left"]/a/text()').extract()
description = response.xpath('//div[#class="aviso"]/text()').extract()
salary = response.xpath('//div[#id="aviso"]/p[1]/text()').extract()
city = response.xpath('//div[#id="aviso"]/p[2]/text()').extract()
district = response.xpath('//div[#id="aviso"]/p[5]/text()').extract()
publication_date = response.xpath('//div[#id="publicado"]/text()').extract()
apply = response.xpath('//p[#class="text-center"]/b/text()').extract()
job_type = response.xpath('//div[#id="resumen"]/p[3]/text()').extract()
items['job_title'] = job_title
items['company'] = company
items['company_url'] = company_url
items['description'] = description
items['salary'] = salary
items['city'] = city
items['district'] = district
items['publication_date'] = publication_date
items['apply'] = apply
items['job_type'] = job_type
yield items```
From what I can see, one of the issues is that you are creating two functions called parse(). Since you are using a self.parse_jobspider in your first parse function, I'm guessing that your second parse function is named incorrectly.
Also, are you sure that the URL in the start_urls is correct? https://cccc/bolsa/ofertas?oferta=&lugar=&categoria= doesn't direct to anywhere which would also explain why data isn't being scraped.
rules = (
Rule(LinkExtractor(allow=('/bolsa/166',)), follow=True, callback='parse_item'),
)
I resolve this adding this code to access in every link and scrape the data inside

Scraping multiple tables and storing each table header as rows in csv

I'm trying to scrape multiple tables which have a table name stored under a h3 tag. There is Columns of data I can scrape no problem and when I feed the next url I can append this data to the csv file.
The problem I can't solve is to get the table header and store this relative to each row of table. The reason for this is when the next table is fed I need to know which table it belongs. Is it possible to use a len loop on we say 'Round' to establish the table length and then write the table header to each row? is it possible to do with item exports?
Here's my code
spider.py
from bigcrawler.items import BigcrawlerItem
from scrapy import Spider, Request, Selector
from scrapy.selector import Selector
from bigcrawler.items import MatchStatItemLoader
class CrawlbotSpider(Spider):
name = 'bigcrawler'
allowed_domains = ['www.matchstat.com']
start_urls = [
'https://matchstat.com/tennis/tournaments/w/Taipei/2015',
'https://matchstat.com/tennis/tournaments/w/Hong%20Kong/2017',
]
def parse_header(self , response):
hxs = Selector(response)
for tb in hxs.css('tr.match'):
heading = tb.xpath('//*[#id="AWS"]/div/h3/text()').extract()[0]
for td in tb.xpath(".//tr[contains(#class,
'match')]/td[contains(#class, 'round')]/text()"):
il = BigcrawlerItem(selector=td)
il.add_value('event_title' , heading)
yield il.load_item()
def parse(self , response):
for row in response.css('tr.match'):
il = MatchStatItemLoader(selector=row)
il.add_css('round' , '.round::text')
il.add_css('event1' , '.event-name a::text')
il.add_css('player_1' , '.player-name:nth-child(2) a::text')
il.add_css('player_2' , '.player-name:nth-child(3) a::text')
il.add_css('player_1_odds' , '.odds-td.odds-0
[payout]::text')
il.add_css('player_2_odds' , '.odds-td.odds-1
[payout]::text')
il.add_css('h_2_h' , 'a.h2h::text')
yield il.load_item()
items.py
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from operator import methodcaller
from scrapy import Spider, Request, Selector
class BigcrawlerItem(scrapy.Item):
# define the fields for your item here like:
event_title = scrapy.Field()
round = scrapy.Field()
event1 = scrapy.Field()
player_1 = scrapy.Field()
player_2 = scrapy.Field()
player_1_odds = scrapy.Field()
player_2_odds = scrapy.Field()
h_2_h = scrapy.Field()
class MatchStatItemLoader(ItemLoader):
default_item_class = BigcrawlerItem
default_input_processor = MapCompose(methodcaller('strip'))
default_output_processor = TakeFirst()
If there's only one of those headings you don't need to be relative to the current node, try this:
il.add_xpath('event_title', '//*[#id="AWS"]//h3/text()')
but if you need it to be relative to the current node you could also do this:
il.add_xpath('event_title', './ancestor::*[#id="AWS"]//h3/text()')
I would suggest completely not use Items class and use start_requests method instead of start_urls as they are really confusing. See the fully working code here. Also notice the match_heading variable.
class CrawlbotSpider(Spider):
name = 'bigcrawler'
allowed_domains = ['www.matchstat.com']
start_urls = [
'https://matchstat.com/tennis/tournaments/w/Taipei/2015',
'https://matchstat.com/tennis/tournaments/w/Hong%20Kong/2017',
]
def start_requests(self):
match_urls = [
'https://matchstat.com/tennis/tournaments/w/Taipei/2015',
'https://matchstat.com/tennis/tournaments/w/Hong%20Kong/2017',
]
for url in match_urls:
yield Request(url=url, callback=self.parse_matches)
def parse_matches(self , response):
match_heading = response.xpath('//*[#id="AWS"]/div/h3/text()').extract_first()
for row in response.css('tr.match'):
match = {}
match['heading'] = match_heading
match['round'] = row.css(".round::text").extract_first()
match['event1'] = row.css(".event-name a::text").extract_first()
match['player_1'] = row.css(".player-name:nth-child(2) a::text").extract_first()
match['player_2'] = row.css(".player-name:nth-child(3) a::text").extract_first()
match['player_1_odds'] = row.css(".odds-td.odds-0 [payout]::text").extract_first()
match['player_2_odds'] = row.css(".odds-td.odds-1 [payout]::text").extract_first()
match['h_2_h'] = row.css("a.h2h::text::text").extract_first()
yield match

Scrapy neither shows any error nor fetches any data

Tried to parse product name and price from a site using scrapy. However, When i run my scrapy code it neither shows any error nor fetches any data. What I'm doing wrong is beyond my capability to find out. Hope there is someone to take a look into it.
"items.py" includes:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
spider file named "sephorasp.py" contains:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SephoraspSpider(CrawlSpider):
name = "sephorasp"
allowed_domains = ['sephora.ae']
start_urls = ["https://www.sephora.ae/en/stores/"]
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[#class="level0 nav-1 active first touch-dd parent"]')),
Rule(LinkExtractor(restrict_xpaths='//li[#class="level2 nav-1-1-1 active first"]'),
callback="parse_item")
]
def parse_item(self, response):
page = response.xpath('//div[#class="product-info"]')
for titles in page:
Product = titles.xpath('.//a[#title]/text()').extract()
Rate = titles.xpath('.//span[#class="price"]/text()').extract()
yield {'Name':Product,'Price':Rate}
Here is the Link to the Log:
"https://www.dropbox.com/s/8xktgh7lvj4uhbh/output.log?dl=0"
It works when I play around with BaseSpider:
from scrapy.spider import BaseSpider
from scrapy.http.request import Request
class SephoraspSpider(BaseSpider):
name = "sephorasp"
allowed_domains = ['sephora.ae']
start_urls = [
"https://www.sephora.ae/en/travel-size/make-up",
"https://www.sephora.ae/en/perfume/women-perfume",
"https://www.sephora.ae/en/makeup/eye/eyeshadow",
"https://www.sephora.ae/en/skincare/moisturizers",
"https://www.sephora.ae/en/gifts/palettes"
]
def pro(self, response):
item_links = response.xpath('//a[contains(#class,"level0")]/#href').extract()
for a in item_links:
yield Request(a, callback = self.end)
def end(self, response):
item_link = response.xpath('//a[#class="level2"]/#href').extract()
for b in item_link:
yield Request(b, callback = self.parse)
def parse(self, response):
page = response.xpath('//div[#class="product-info"]')
for titles in page:
Product= titles.xpath('.//a[#title]/text()').extract()
Rate= titles.xpath('.//span[#class="price"]/text()').extract()
yield {'Name':Product,'Price':Rate}
Your xpaths are heavily flawed.
Rule(LinkExtractor(restrict_xpaths='//li[#class="level0 nav-1 active first touch-dd parent"]')),
Rule(LinkExtractor(restrict_xpaths='//li[#class="level2 nav-1-1-1 active first"]'),
You are matching whole class ranges which can change at any point and the order might be different in scrapy. Just pick one class, it's most likely unique enough:
Rule(LinkExtractor(restrict_xpaths='//li[contains(#class,"level0")]')),
Rule(LinkExtractor(restrict_xpaths='//li[contains(#class,"level2")]')),

Scrapy Spider cannot Extract contents of web page using xpath

I have scrapy spider and i am using xpath selectors to extract the contents of the page,kindly check where i am going wrong
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from medicalproject.items import MedicalprojectItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import Request
class MySpider(CrawlSpider):
name = "medical"
allowed_domains = ["yananow.org"]
start_urls = ["http://yananow.org/query_stories.php"]
rules = (
Rule(SgmlLinkExtractor(allow=[r'display_story.php\?\id\=\d+']),callback='parse_page',follow=True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td')
items = []
for title in titles:
item = MedicalprojectItem()
item["patient_name"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/img[1]/text()").extract()
item["stories"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/div/font/p/text()").extract()
items.append(item)
return(items)
There are a lot of issues with your code so here is a different approach.
I opted against a CrawlSpider to have more control over the scraping process. Especially with grabbing the name from the query page and the story from a detail page.
I tried to simplify the XPath statements by not diving into the (nested) table structures but looking for patterns of content. So if you want to extract a story ... there must be a link to a story.
Here comes the tested code (with comments):
# -*- coding: utf-8 -*-
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
story = scrapy.Field()
class MySpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['yananow.org']
start_urls = ['http://yananow.org/query_stories.php']
def parse(self, response):
rows = response.xpath('//a[contains(#href,"display_story")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
story_url = response.urljoin(row.xpath('./#href').extract()[0]) # extract url from link
request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
text_raw = response.xpath('//font[#size=3]//text()').extract() # extract the story (text)
myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item

How to use scrapy to scrape google play reviews of applications?

I wrote this spider to scrape reviews of apps from google play. I am partially successful in this. I am able to extract the name, date, and review only.
My questions:
How to get all the reviews as I am only getting only 41.
How to get the rating from the <div>?
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
rating = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()
class criticspider(CrawlSpider):
name = "gaana"
allowed_domains = ["play.google.com"]
start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
# rules = (
# Rule(
# SgmlLinkExtractor(allow=('search=jabong&page=1/+',)),
# callback="parse_start_url",
# follow=True),
# )
def parse(self, response):
sites = response.xpath('//div[#class="single-review"]')
items = []
for site in sites:
item = CompItem()
item['data'] = site.xpath('.//div[#class="review-body"]/text()').extract()
item['name'] = site.xpath('.//div/div/span[#class="author-name"]/a/text()').extract()[0]
item['date'] = site.xpath('.//span[#class="review-date"]/text()').extract()[0]
item['rating'] = site.xpath('div[#class="review-info-star-rating"]/aria-label/text()').extract()
items.append(item)
return items
you have
item['rating'] = site.xpath('div[#class="review-info-star-rating"]/aria-label/text()').extract()
should it not be something like:
item['rating'] = site.xpath('.//div[#class="review-info-star-rating"]/aria-label/text()').extract()
?? dunno if it will work, but try :)
You can try this one out:
item['rating'] = site.xpath('.//div[#class="tiny-star star-rating-non-editable-container"]/#aria-label').extract()

Categories