I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
Related
I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.
I'm using a scrapy web crawler to extract a bunch of data, as I describe here, I've figured out a brute force way to get the information I want, but.. it's really pretty crude. I just ennumerate all the pages I want to scrape, which is a few hundred. I need to get this done, so I might just grit my teeth and bear it like a moron, but it would be so much nicer to automate this. How could this process be implemented with link extraction using scrapy? I've looked at the documentation and made some experiments as I desribe in the question linked above but nothing yet has worked. This is the brute force code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from brute_force.items import BruteForceItem
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ["http://tool.httpcn.com/Html/Zi/21/PWAZAZAZXVILEPWXV.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQCQILEPWB.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQKOILEPWD.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQUYILEPWF.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQMEILEKOCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQRNILEKOKO.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILUYUYKOTBCQ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZAZRNILEPWRN.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQPWILEPWC.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQILILEPWE.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQTBILEKOAZ.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZCQXVILEKOPW.shtml",
"http://tool.httpcn.com/Html/Zi/21/PWAZAZPWAZILEKOIL.shtml",
"http://tool.httpcn.com/Html/Zi/22/PWCQKOILRNUYKOTBUY.shtml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
items.append(item)
return items
I think this is what you want:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from brute_force.items import BruteForceItem
from urlparse import urljoin
class DmozSpider(BaseSpider):
name = "brutus"
allowed_domains = ["tool.httpcn.com"]
start_urls = ['http://tool.httpcn.com/Zi/BuShou.html']
def parse(self, response):
for url in response.css('td a::attr(href)').extract():
cb = self.parse if '/zi/bushou' in url.lower() else self.parse_item
yield Request(urljoin(response.url, url), callback=cb)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = BruteForceItem()
item["the_strokes"] = hxs.xpath('//*[#id="div_a1"]/div[2]').extract()
item["character"] = hxs.xpath('//*[#id="div_a1"]/div[3]').extract()
return item
try this
1.
the spider start with the start_urls.
2.
self.parse. I just find all the a tag in the td tag.
if the url contains '/zi/bushou' then the response should be go to self.parse again because it is what you called 'second layer'.
if not '/zi/bushou' (i think use a more specific regex here is better) like url. i think it is what you want and goes to parse_item function.
3.
self.parse_item. this is the function that you use to get the information from the final page.
I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.
Key: the current Url
Value: all the links on this page.
class MySpider(CrawlSpider):
name = 'abc.com'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (Rule(SgmlLinkExtractor()), )
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = AbcItem()
item['key'] = response.url
item['value'] = hxs.select('//a/#href').extract()
return item
I define my AbcItem() looks like below:
from scrapy.item import Item, Field
class AbcItem(Item):
# key: url
# value: list of links existing in the key url
key = Field()
value = Field()
pass
And when I run my code like this:
nohup scrapy crawl abc.com -o output -t csv &
The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?
You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['twitter.com']
start_urls = ['http://www.twitter.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
Then, in the output file, I see:
http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...
Hope that helps.
if you dont set the callback function explicitly, scrapy will use the method parse to process crawled pages. so, you should add parse_item as the callback, or change it's name to parse.
How to follow links in this example : http://snippets.scrapy.org/snippets/7/ ?
The script stop after visiting the link of the first page.
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[#id="content"]//div[contains(#class, "question-summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/#title')
loader.add_xpath('tags', './/a[#rel="tag"]/text()')
loader.add_xpath('user', './/div[#class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[#class="started"]/a[1]/span/#title')
loader.add_xpath('votes', './/div[#class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(#class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[#class="views"]/div[1]/text()')
yield loader.load_item()
i've tried to change :
class MySpider(BaseSpider):
To
class MySpider(CrawlSpider)
And add
rules = (
Rule(SgmlLinkExtractor(allow=()),
callback='parse',follow=True),
)
But it doesn't crawl all the site
Thanks,
Yes, you need to subclass CrawlSpider, and rename parse function to something like parse_page, because CrawlSpider uses parse to start scraping.
This was already answered