scrapy crawling at depth not working - python

I am writing scrapy code to crawl first page and one additional depth of given webpage
Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation.
I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what can be possible reason and what can i change to make it follow rules
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from crawlWeb.items import CrawlwebItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class DmozSpider(CrawlSpider):
name = "premraj"
start_urls = [
"http://www.broadcom.com",
"http://www.qualcomm.com"
]
rules = [Rule(SgmlLinkExtractor(), callback='parse',process_links="process_links",follow=True)]
def parse(self, response):
#print dir(response)
#print dir(response)
item=CrawlwebItem()
item["html"]=response.body
item["url"]=response.url
yield item
def process_links(self,links):
print links
print "hey!!!!!!!!!!!!!!!!!!!!!"

There is a Warning box in the CrawlSpider documentation. It says:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Your code does probably not work as expected because you do use parse as callback.

Related

Scrapy keeps crawling and never stops... CrawlSpider rules

I'm very new to python and scrapy and decided to try and built a spider instead of just being scared of the new/challenging looking language.
So this is the first spider and it's purpose :
It runs through a website's pages (through links it finds on every
page)
List all the links (a>href) that exist on every page
Writes down in each row: the page where the links were found, the links themselves
(decoded+languages), number of links on every page, and http response code of every link.
The problem I'm encountering is that it's never stopping the crawl, it seems stuck in a loop and always re-crawling every page more then once...
What did I do wrong? (obviously many things since I never wrote a python code before, but still)
How can I make the spider crawl every page only once?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urllib.parse
import requests
import threading
class TestSpider(CrawlSpider):
name = "test"
allowed_domains = ["cerve.co"]
start_urls = ["https://cerve.co"]
rules = [Rule (LinkExtractor(allow=['.*'], tags='a', attrs='href'), callback='parse_item', follow=True)]
def parse_item(self, response):
alllinks = response.css('a::attr(href)').getall()
for link in alllinks:
link = response.urljoin(link)
yield {
'page': urllib.parse.unquote(response.url),
'links': urllib.parse.unquote(link),
'number of links': len(alllinks),
'status': requests.get(link).status_code
}
Scrapy said :
By default, Scrapy filters out duplicated requests to URLs already visited. This can be configured by the setting DUPEFILTER_CLASS.
Solution 1 : https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DUPEFILTER_CLASS
My experience with your code :
There are so many links . And i did not see any duplicates urls being visited twice.
Solutions 2 in worst case
In settings.py set DEPTH_LIMIT= some number of your choice

Python Scrapy does not crawl website

I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.

How does scrapy use rules?

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.
If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?
Thanks.
The rules attribute for a CrawlSpider specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class -- look here to read the source.
So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.
The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.
Example:
For example, say you have a CrawlSpider like so:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
start_urls = ['http://someurlhere.com']
rules = (
Rule(
LinkExtractor(restrict_xpaths=[
"//ul[#class='menu-categories']",
"//ul[#class='menu-subcategories']"]),
callback='parse'
),
Rule(
LinkExtractor(allow='/product.php?id=\d+'),
callback='parse_product_page'
),
)
def parse_product_page(self, response):
# yield product item here
The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.
For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.
The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.
For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.
As you can see, pretty powerful stuff. =)
Read more:
CrawlSpider - Scrapy docs
Link extractors - Scrapy docs

scrapy didn't crawl all link

I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),

Trying to crawl all links of a webpage with scrapy. But I cannot output the links on a page

My first question here :)
I was trying to crawl my schools website for all possible webpages there are. But I cannot get the links into a text file. I have the right permissions, so that is not the problem.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
class hsleidenSpider(CrawlSpider):
name = "hsleiden1"
allowed_domains = ["hsleiden.nl"]
start_urls = ["http://hsleiden.nl"]
# allow=() is used to match all links
rules = [
Rule(SgmlLinkExtractor(allow=()), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self, response):
x = HtmlXPathSelector(response)
filename = "hsleiden-output.txt"
open(filename, 'ab').write(response.url)
So I am only scanning on the hsleiden.nl page. And I would like to have the response.url into the textfile hsleiden-output.txt.
Is there any way to do this right?
With reference to the documentation for CrawlSpider, if multiple rules match the same link then only the first will be used.
Thus, as a result of redirects, using the first rule results in a seemingly infinite loop. Since the second rule is ignored, none of the matching links are ever passed to the parse_item callback, which means no output file.
Some investigation is required to fix the redirect issue (and to modify the first rule so that it doesn't clash with the second), but commenting it out entirely will produce an output file of links like so:
http://www.hsleiden.nl/activiteitenkalenderhttp://www.hsleiden.nlhttp://www.hsleiden.nl/vind-je-studie/proefstuderenhttp://www.hsleiden.nl/studiumgenerale
etc
They were all munged together on a single line, so you might want to add a newline character or separator each time you write to the output file.

Categories