I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I have tested the xpaths in Scrapy Shell, and both return the expected results.
My item.py is:
import scrapy
class StockItem(scrapy.Item):
quote = scrapy.Field()
time = scrapy.Field()
My spider, named stockscrapy.py, is:
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
To run the script, I use the command line:
scrapy crawl ugaz -o stocks.csv
Any and all help is greatly appreciated.
You need to indent the parse block.
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
# Indent this block
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
Related
I am trying to scrape some info from the companieshouse of the UK using scrapy.
I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[#id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests method should return an iterable of Requests; yours returns None.
The default start_requests creates this iterable from urls specified in start_urls, so simply defining that as a class variable (outside of any function) and not overriding start_requests will work as you want.
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
New to scrapy and I definitely need pointers. I've run through some examples and I'm not getting some basics. I'm running scrapy 1.0.3
Spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem
class MySpider(BaseSpider):
name = "matrix"
allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = MatrixScrapeItem()
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
item['totalPledged'] = hxs.select("//*[#id="pledged"]/data").extract()
print backers, totalPledged
item:
import scrapy
class MatrixScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
backers = scrapy.Field()
totalPledged = scrapy.Field()
pass
I'm getting the error:
File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
Myquestions are: Why isn't the selecting and extracting working properly? I do see people just using Selector a lot instead of HtmlXPathSelector.
Also I'm trying to save this to a csv file and automate it based on time (extract these data points every 30 min). If anyone has any pointers for examples of that, they'd get super brownie points :)
The syntax error is caused by the way you use double quotes. Mix single and double quotes:
item['backers'] = hxs.select('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = hxs.select('//*[#id="pledged"]/data').extract()
As a side note, you can use response.xpath() shortcut instead of instantiating HtmlXPathSelector:
def parse(self, response):
item = MatrixScrapeItem()
item['backers'] = response.xpath('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = response.xpath('//*[#id="pledged"]/data').extract()
print backers, totalPledged
And you've probably meant to get the text() of the data elements:
//*[#id="backers_count"]/data/text()
//*[#id="pledged"]/data/text()
I am trying to crawl the 'Median Gross Monthly Income From Work' from a webpage using the following code:
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
url = 'http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx'
def parse_data(self, response):
table_headers = response.xpath('//tr[#class="odd"]/td/td')
salary = []
for value in table_headers:
data = value.xpath('.//text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
But I do not see any values when I was trying to print out the list that I created to store the values.
Where did I go wrong?
Firs of all, your code won't work.
url should be start_urls to let Scrapy know where to start crawling.
parse_data should be parse because without any information Scrapy does not know which method to call and the default is parse. Otherwise you get a NotImplementedError too when Scrapy crawls the start URL and the parse method is not present.
When I run the code below (which holds all the mentioned changes) and prints the response.body to the console I do not find any element with class="odd" so I guess there are some AJAX/XHR calls inside the site which then provide the information.
EDIT
After looking at your code again I see that the XPath is a bit odd. You use tr[#class="odd"]/td/td however one td element does not have another td as its child. If you want to avoid the headers change your extraction as in the code below. With this change I get results in the salary list.
import scrapy
from scrapy.crawler import CrawlerProcess
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
start_urls = ['http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx']
def parse(self, response):
print response.body
table_headers = response.xpath('//tr[#class="odd"]//td')
salary = []
for value in table_headers[1:]:
data = value.xpath('./text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.