Back to basics: Scrapy - python

New to scrapy and I definitely need pointers. I've run through some examples and I'm not getting some basics. I'm running scrapy 1.0.3
Spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem
class MySpider(BaseSpider):
name = "matrix"
allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = MatrixScrapeItem()
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
item['totalPledged'] = hxs.select("//*[#id="pledged"]/data").extract()
print backers, totalPledged
item:
import scrapy
class MatrixScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
backers = scrapy.Field()
totalPledged = scrapy.Field()
pass
I'm getting the error:
File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
Myquestions are: Why isn't the selecting and extracting working properly? I do see people just using Selector a lot instead of HtmlXPathSelector.
Also I'm trying to save this to a csv file and automate it based on time (extract these data points every 30 min). If anyone has any pointers for examples of that, they'd get super brownie points :)

The syntax error is caused by the way you use double quotes. Mix single and double quotes:
item['backers'] = hxs.select('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = hxs.select('//*[#id="pledged"]/data').extract()
As a side note, you can use response.xpath() shortcut instead of instantiating HtmlXPathSelector:
def parse(self, response):
item = MatrixScrapeItem()
item['backers'] = response.xpath('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = response.xpath('//*[#id="pledged"]/data').extract()
print backers, totalPledged
And you've probably meant to get the text() of the data elements:
//*[#id="backers_count"]/data/text()
//*[#id="pledged"]/data/text()

Related

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):

Scrapy returning zero results

I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I have tested the xpaths in Scrapy Shell, and both return the expected results.
My item.py is:
import scrapy
class StockItem(scrapy.Item):
quote = scrapy.Field()
time = scrapy.Field()
My spider, named stockscrapy.py, is:
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
To run the script, I use the command line:
scrapy crawl ugaz -o stocks.csv
Any and all help is greatly appreciated.
You need to indent the parse block.
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
# Indent this block
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock

scrapy crawl multiple pages, extracting data and saving into mysql

Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.

brute force web crawler, how to use Link Extractor towards increased automation. Scrapy

I'm using a scrapy web crawler to extract a bunch of data, as I describe here, I've figured out a brute force way to get the information I want, but.. it's really pretty crude. I just ennumerate all the pages I want to scrape, which is a few hundred. I need to get this done, so I might just grit my teeth and bear it like a moron, but it would be so much nicer to automate this. How could this process be implemented with link extraction using scrapy? I've looked at the documentation and made some experiments as I desribe in the question linked above but nothing yet has worked. This is the brute force code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from brute_force.items import BruteForceItem
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/21/PWAZAZAZXVILEPWXV.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQCQILEPWB.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQKOILEPWD.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQUYILEPWF.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQMEILEKOCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQRNILEKOKO.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILUYUYKOTBCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZAZRNILEPWRN.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQPWILEPWC.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQILILEPWE.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQTBILEKOAZ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQXVILEKOPW.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZPWAZILEKOIL.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILRNUYKOTBUY.shtml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
items.append(item)
return items
I think this is what you want:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from brute_force.items import BruteForceItem
from urlparse import urljoin
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ['http://tool.httpcn.com/Zi/BuShou.html']
def parse(self, response):
for url in response.css('td a::attr(href)').extract():
cb = self.parse if '/zi/bushou' in url.lower() else self.parse_item
yield Request(urljoin(response.url, url), callback=cb)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
return item
try this
1.
the spider start with the start_urls.
2.
self.parse. I just find all the a tag in the td tag.
if the url contains '/zi/bushou' then the response should be go to self.parse again because it is what you called 'second layer'.
if not '/zi/bushou' (i think use a more specific regex here is better) like url. i think it is what you want and goes to parse_item function.
3.
self.parse_item. this is the function that you use to get the information from the final page.

Scrapy Newbie Question - can't get tutorial file working

I am a complete newbie to Python and Scrapy so I started by trying to replicate the tutorial. I am trying to scrape the www.dmoz.org website as per the tutorial.
I compose the dmoz_spider.py as indicated below
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
and what I am supposed to get via website is something different.
any idea what I am screwing up?
I had this problem. Make sure you made the below change as it says to do in the tutorial.
Open items.py and see if you changed class
class TutorialItem(Item):
title=Field()
link=Field()
desc=Field()
into:
class DmozItem(Item):
title=Field()
link=Field()
desc=Field()
There is nothing wrong with the code you pasted. The problem must be elsewhere, can you paste the whole output you get? (your comment stops where the interesting part starts...)
You need to go the the directory containing the settings.py file and run
scrapy crawl dmoz from there.
FOllow the structure of your project against https://github.com/scrapy/dirbot for clarity

Categories