How to follow links in this example : http://snippets.scrapy.org/snippets/7/ ?
The script stop after visiting the link of the first page.
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[#id="content"]//div[contains(#class, "question-summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/#title')
loader.add_xpath('tags', './/a[#rel="tag"]/text()')
loader.add_xpath('user', './/div[#class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[#class="started"]/a[1]/span/#title')
loader.add_xpath('votes', './/div[#class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(#class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[#class="views"]/div[1]/text()')
yield loader.load_item()
i've tried to change :
class MySpider(BaseSpider):
To
class MySpider(CrawlSpider)
And add
rules = (
Rule(SgmlLinkExtractor(allow=()),
callback='parse',follow=True),
)
But it doesn't crawl all the site
Thanks,
Yes, you need to subclass CrawlSpider, and rename parse function to something like parse_page, because CrawlSpider uses parse to start scraping.
This was already answered
Related
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I'm scraping reddit to get the link of every entry in a subreddit. And I would like to follow the links that match http://imgur.com/gallery/\w* too. But I'm having problems to run the callback for Imgur. It just doesn't execute it. What's failing ?
And I'm detecting the Imgur url with a simple if "http://imgur.com/gallery/" in item['link'][0]: statement, maybe scrapy provides a better way to detect them ?
This is what I tried:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]
rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]
def parse_item(self, response):
for title in response.xpath("//div[contains(#class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('#href').extract()
yield item
if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)
def parse_imgur_gallery(self, response):
print response.url
This is my Item class:
import scrapy
class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
This is the output when executing the spider with --nolog and printing the url variable inside the if condition (It's not the response.url var output), It still doesn't run the callback:
PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...
I've found It. The imgur.com domain wasn't allowed. Just needed to add it...
allowed_domains = ["reddit.com", "imgur.com"]
I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links
You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.
For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph
I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.
Key: the current Url
Value: all the links on this page.
class MySpider(CrawlSpider):
name = 'abc.com'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (Rule(SgmlLinkExtractor()), )
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = AbcItem()
item['key'] = response.url
item['value'] = hxs.select('//a/#href').extract()
return item
I define my AbcItem() looks like below:
from scrapy.item import Item, Field
class AbcItem(Item):
# key: url
# value: list of links existing in the key url
key = Field()
value = Field()
pass
And when I run my code like this:
nohup scrapy crawl abc.com -o output -t csv &
The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?
You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['twitter.com']
start_urls = ['http://www.twitter.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
Then, in the output file, I see:
http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...
Hope that helps.
if you dont set the callback function explicitly, scrapy will use the method parse to process crawled pages. so, you should add parse_item as the callback, or change it's name to parse.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc
This is my code. I want plenty of URLs to scrape using loop. So how am I suposed to these? I did put multiple urls in there but I didn't get output from all of them. Some URLs stop responding. So how can I get the data for sure using this code?
You code looks ok but are you sure that start_urls shouldn't start with http://
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
UPD
start_urls is a list of urls scrapy starts with. Usually it has one or two links. Rarely more.
This pages must have identical HTML structure because Scrapy spider process them the same way.
See if i put 4-5 url's in start_urls it gives output ok for first 2-3
url's.
I don't believe this because scrapy doesn't care how many links is start_urls list.
But it stops responding and also tell me how i can implement GUI for this.?
Scrapy has debug shell to test you code.
You just posted the code from the tutorial. What you should do is to actually read the whole documentation, especially the basic concept part. What you basically want is the crawl spider where you can define rules that the spider will follow and process with your given code.
To quote the docs with the example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[#id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[#id="item_description"]/text()').extract()
return item