Stop Scrapy after N items scraped

Stop Scrapy after N items scraped - python

I'm having trouble with Scrapy. I need code that will scrap up to 1000 internal links per given url. My code works when run at command line, but the spider doesn't stop, only receives the message.
My code is as follows:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.contrib.closespider import CloseSpider
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'testspider1'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
scrape_count = self.crawler.stats.get_value('item_scraped_count')
print scrape_count
limit = 10
if scrape_count == limit:
raise CloseSpider('Limit Reached')
return item

My problem was trying to apply close spider in the wrong place. It's a variable that needs to be set in the settings.py file. When I set it manually in there, or set it as a argument in the command line, it worked (Stopping within 10-20 of N for what it's worth).
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 1000
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'yo mama'
LOG_LEVEL = 'DEBUG'

Related

Network Graph output from Scrapy

I'm pretty new to using Scrapy and I'm having difficulties. I'm trying to work with scrapy to crawl a website and return a list of nodes and edges to build a network graph of internal and external websites from my start page to a depth of x (to be determined).
I have the following code and I'm having trouble figuring out what the issue is.
My items.py file looks like this:
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
my graphspider.py file is as follows:
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class GraphspiderSpider(CrawlSpider):
name = 'graphspider'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/products/']
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[#href]'):
href=anchor.select('#href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
and I modified the settings.py file to include:
BOT_NAME = 'sitegraph'
SPIDER_MODULES = ['sitegraph.spiders']
NEWSPIDER_MODULE = 'sitegraph.spiders'
FEED_FORMAT="jsonlines"
FEED_URI="C:\\Users\Merrie\\Desktop\\testscrape\\sitegraph\\sitegraph.json"
When I run it I'm using the following code:
$ scrapy crawl graphspider -o attempt2.csv
And my output table is empty. It also keeps throwing this error: "KeyError: 'SitegraphItem does not support field: http_status'"

Missing http_statusfield in your items.py causes the error, please update it.
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
http_status=Field()

Scrapy is Visiting same Url despite dont_filter=False

Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items

Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.

scrapy crawl spider ajax pagination

I was trying to scrap link which has ajax call for pagination.
I am trying to crawl http://www.demo.com link. and in .py file I provided this code for restrict XPATH and coding is:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem
class Sumspider1(sumSpider):
name = 'sumDetailsUrls'
allowed_domains = ['sum.com']
start_urls = ['http://www.demo.com']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//ul[#id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)
#use parse_start_url if your spider wants to crawl from first page , so overriding
def parse_start_url(self, response):
print '********************************************1**********************************************'
#//div[#class="showMoreCars hide"]/a
#.//ul[#id="pager"]/li[8]/a/#href
self.log('Inside - parse_item %s' % response.url)
hxs = HtmlXPathSelector(response)
item = sumItem()
item['page'] = response.url
title = hxs.xpath('.//h1[#class="page-heading"]/text()').extract()
print '********************************************title**********************************************',title
urls = hxs.xpath('.//a[#id="linkToDetails"]/#href').extract()
print '**********************************************2***url*****************************************',urls
finalurls = []
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
return item
My items.py file contains
from scrapy.item import Item, Field
class sumItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
page = Field()
urls = Field()
Still I'm not getting exact output not able to fetch all pages when I am crawling it.

I hope the below code will help.
somespider.py
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData
class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']
def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[#class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[#class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[#id="linkToDetails"]')
for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))
item['urls'] = finalurls
except:
break
self.driver.close()
return item
items.py
from scrapy.item import Item, Field
class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
Note:
You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.
Run your selenium rc server issuing the command :
java -jar selenium-server-standalone-2.44.0.jar
Run your spider using command:
spider crawl domainurls -o someoutput.json

You can check with your browser how the requests are made.
Behind the scene, right after you click on that button "show more cars" your browser will request a JSON data to feed your next page. You can take advantage of this fact and deal directly with the JSON data without the necessity to work with a JavaScript engine as Selenium or PhantomJS.
In your case, as the first step you should simulate an user scrolling down the page given by your start_url parameter and profile at the same time your network requests to discover the endpoint used by the browser to request that JSON. To discover this endpoint in general there is a XHR(XMLHttpRequest) section on the browser's profile tool as here in Safari where you can navigate thought all resources/endpoints used to request the data.
Once you discover this endpoint it's a straightforward task: you give your Spider as start_url the endpoint that you just discovered and according you process and navigate through the JSON's you can discover if it a next page to request.
P.S.: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2
In this case my browser requested the second page, as you can see in the parameter pn. It's is important you set the some header parameters before you send the request. I noticed in your case the headers are:
Accept text/plain, /; q=0.01
Referer http://www.carwale.com/used/cars-in-trichy/
X-Requested-With XMLHttpRequest
sourceid 1
User-Agent Mozilla/5.0...

Avoid bad requests due to relative urls

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product

Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.

I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links

Using Scrapy to crawl the urls in the webpage

I'm using scrapy to extract data from certain websites.The problem is that my spider can only crawl the webpage of initial start_urls , it can't crawl the urls in the webpage.
I copied the same spider exactly:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from nextlink.items import NextlinkItem
class Nextlink_Spider(BaseSpider):
name = "Nextlink"
allowed_domains = ["Nextlink"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/#href')
for site in sites:
relative_url = site.extract()
url = self._urljoin(response,relative_url)
yield Request(url, callback = self.parsetext)
def parsetext(self, response):
log = open("log.txt", "a")
log.write("test if the parsetext is called")
hxs = HtmlXPathSelector(response)
items = []
texts = hxs.select('//div').extract()
for text in texts:
item = NextlinkItem()
item['text'] = text
items.append(item)
log = open("log.txt", "a")
log.write(text)
return items
def _urljoin(self, response, url):
"""Helper to convert relative urls to absolute"""
return urljoin_rfc(response.url, url, response.encoding)
I use the log.txt to test if the parsetext is called.However, after I runned my spider, there is nothing in the log.txt.

See here:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.
So, as long as you didn't activate the OffsiteMiddleware in your settings, it doesn't matter and you can leave allowed_domains completely out.
Check the settings.py whether the OffsiteMiddleware is activated or not. It shouldn't be activated if you want to allow your Spider to crawl on any domain.

I think the problem is, that you didn't tell Scrapy to follow each crawled URL. For my own blog I've implemented a CrawlSpider that uses LinkExtractor-based Rules to extract all relevant links from my blog pages:
# -*- coding: utf-8 -*-
'''
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* #author Marcel Lange <info#ask-sheldon.com>
* #package ScrapyCrawler
'''
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import Crawler.settings
from Crawler.items import PageCrawlerItem
class SheldonSpider(CrawlSpider):
name = Crawler.settings.CRAWLER_NAME
allowed_domains = Crawler.settings.CRAWLER_DOMAINS
start_urls = Crawler.settings.CRAWLER_START_URLS
rules = (
Rule(
LinkExtractor(
allow_domains=Crawler.settings.CRAWLER_DOMAINS,
allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
deny=Crawler.settings.CRAWLER_DENY_REGEX,
restrict_css=Crawler.settings.CSS_SELECTORS,
canonicalize=True,
unique=True
),
follow=True,
callback='parse_item',
process_links='filter_links'
),
)
# Filter links with the nofollow attribute
def filter_links(self, links):
return_links = list()
if links:
for link in links:
if not link.nofollow:
return_links.append(link)
else:
self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
return return_links
def parse_item(self, response):
# self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
item = PageCrawlerItem()
item['status'] = response.status
item['title'] = response.xpath('//title/text()')[0].extract()
item['url'] = response.url
item['headers'] = response.headers
return item
On https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/ I've described detailed how I've implemented a website crawler to warm up my Wordpress fullpage cache.

My guess would be this line:
allowed_domains = ["Nextlink"]
This isn't a domain like domain.tld, so it would reject any links.
If you take the example from the documentation: allowed_domains = ["dmoz.org"]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.