I am trying to crawl the 'Median Gross Monthly Income From Work' from a webpage using the following code:
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
url = 'http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx'
def parse_data(self, response):
table_headers = response.xpath('//tr[#class="odd"]/td/td')
salary = []
for value in table_headers:
data = value.xpath('.//text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
But I do not see any values when I was trying to print out the list that I created to store the values.
Where did I go wrong?
Firs of all, your code won't work.
url should be start_urls to let Scrapy know where to start crawling.
parse_data should be parse because without any information Scrapy does not know which method to call and the default is parse. Otherwise you get a NotImplementedError too when Scrapy crawls the start URL and the parse method is not present.
When I run the code below (which holds all the mentioned changes) and prints the response.body to the console I do not find any element with class="odd" so I guess there are some AJAX/XHR calls inside the site which then provide the information.
EDIT
After looking at your code again I see that the XPath is a bit odd. You use tr[#class="odd"]/td/td however one td element does not have another td as its child. If you want to avoid the headers change your extraction as in the code below. With this change I get results in the salary list.
import scrapy
from scrapy.crawler import CrawlerProcess
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
start_urls = ['http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx']
def parse(self, response):
print response.body
table_headers = response.xpath('//tr[#class="odd"]//td')
salary = []
for value in table_headers[1:]:
data = value.xpath('./text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
Related
I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}
I'm trying to scrape all the data from a website called quotestoscrape. But, When I try to run my code it's only getting the one random quote. It should take at least all the data from that page only but it's only taking one. Also, if somehow I get the data from page 1 now what I want is to get the data from all the pages.
So how do I solve this error(which should take all the data from the page1)?
How do I take all the data which is present on the next pages?
items.py file
import scrapy
class QuotetutorialItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
quotes_spider.py file
import scrapy
from ..items import QuotetutorialItem
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
Please tell me what change I can do?
As reported, it's missing an ident level on your yield. And to follow next pages, just add a check for the next button, and yield a request following it.
import scrapy
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = {}
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page:
yield response.follow(next_page)
As #LanteDellaRovere has correctly identified in a comment, the yield statement should be executed for each iteration of the for loop - which is why you are only seeing a single (presumably the last) link from each page.
As far as reading the continued pages, you could extract it from the <nav> element at the bottom of the page, but the structure is very simple - the links (when no tag is specified) are of the form
http://quotes.toscrape.com/page/N/
You will find that for N=1 you get the first page. So just access the URLs for increasing values of N until the attempt sees a 404 return should work as a simplistic solution.
Not knowing much about Scrapy I can't give you exact code, but the examples at https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links are fairly helpful if you want a more sophisticated and Pythonic approach.
I am trying to scrape some info from the companieshouse of the UK using scrapy.
I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[#id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests method should return an iterable of Requests; yours returns None.
The default start_requests creates this iterable from urls specified in start_urls, so simply defining that as a class variable (outside of any function) and not overriding start_requests will work as you want.
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I have tested the xpaths in Scrapy Shell, and both return the expected results.
My item.py is:
import scrapy
class StockItem(scrapy.Item):
quote = scrapy.Field()
time = scrapy.Field()
My spider, named stockscrapy.py, is:
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
To run the script, I use the command line:
scrapy crawl ugaz -o stocks.csv
Any and all help is greatly appreciated.
You need to indent the parse block.
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
# Indent this block
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock