I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.
Related
I'm very new to python and scrapy and decided to try and built a spider instead of just being scared of the new/challenging looking language.
So this is the first spider and it's purpose :
It runs through a website's pages (through links it finds on every
page)
List all the links (a>href) that exist on every page
Writes down in each row: the page where the links were found, the links themselves
(decoded+languages), number of links on every page, and http response code of every link.
The problem I'm encountering is that it's never stopping the crawl, it seems stuck in a loop and always re-crawling every page more then once...
What did I do wrong? (obviously many things since I never wrote a python code before, but still)
How can I make the spider crawl every page only once?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urllib.parse
import requests
import threading
class TestSpider(CrawlSpider):
name = "test"
allowed_domains = ["cerve.co"]
start_urls = ["https://cerve.co"]
rules = [Rule (LinkExtractor(allow=['.*'], tags='a', attrs='href'), callback='parse_item', follow=True)]
def parse_item(self, response):
alllinks = response.css('a::attr(href)').getall()
for link in alllinks:
link = response.urljoin(link)
yield {
'page': urllib.parse.unquote(response.url),
'links': urllib.parse.unquote(link),
'number of links': len(alllinks),
'status': requests.get(link).status_code
}
Scrapy said :
By default, Scrapy filters out duplicated requests to URLs already visited. This can be configured by the setting DUPEFILTER_CLASS.
Solution 1 : https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DUPEFILTER_CLASS
My experience with your code :
There are so many links . And i did not see any duplicates urls being visited twice.
Solutions 2 in worst case
In settings.py set DEPTH_LIMIT= some number of your choice
I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)
I am conducting a research which relates to distributing the indexing of the internet.
While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl etc.), mine is more focused on incentivising such behavior. I am looking for a simple way to crawl real webpages without knowing anything about their URL or HTML structure and:
extract all their text (in order to index it)
Collect all their URLs and add them to the URLs to crawl
Prevent crashing and elegantly continuing (even without the scraped text) in case of malformed webpage
To clarify - this is only for Proof of Concept (PoC), so I don't mind it won't scale, it's slow, etc. I am aiming at scraping most of the text which is presented to the user, in most cases, with or without dynamic content, and with as little "garbage" such as functions, tags, keywords etc. A working simple partial solution which works out of the box is preferred over the perfect solution which requires a lot of expertise to deploy.
A secondary issue is the storing of the (url,extracted text) for indexing (by a different process?), but I think I will be able to figure it out myself with some more digging.
Any advice on how to augment "itsy"'s parse function will be highly appreciated!
import scrapy
from scrapy_1.tutorial.items import WebsiteItem
class FirstSpider(scrapy.Spider):
name = 'itsy'
# allowed_domains = ['dmoz.org']
start_urls = \
[
"http://www.stackoverflow.com"
]
# def parse(self, response):
# filename = response.url.split("/")[-2] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = WebsiteItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
item['body_text'] = sel.xpath('text()').extract()
yield item
What you are looking for here is scrapy CrawlSpider
CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.
Here's a good example how your spider might look with CrawlSpider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'crawlspider'
start_urls = ['http://scrapy.org']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
# extracting basic body
item['body'] = '\n'.join(response.xpath('//text()').extract())
# or better just save whole source
item['source'] = response.body
return item
This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss.
Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.
There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page
*Note : I wrote the code at spyder and ran it at anaconda command prompt with scrapy crawl KMSS
Question A :
I have my import items error here and so far there is no answer
:
Import Module Error ( I have just added some extra details to the question)
However, the import error does not stop me from running the script at anaconda command prompt ( If I have understood it correctly)
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from crawlKMSS.items import CrawlkmssItem
class KmssSpider(Spider):
name = "KMSS"
allowed_domains = ["~/LotusQuickr/dept"]
loginp = (
'https://~/LotusQuickr/dept/Main.nsf?OpenDatabase&login',
)
start_urls = ('https://~/LotusQuickr/dept/Main.nsf',)
rules = ( Rule(LinkExtractor(),
callback='parse', follow=True),
)
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.loginp, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'username', 'password': 'pw'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "what_should_I_put_here" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
self.initialized()
else:
self.log("You are not logged in.")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
hxs = Selector(response)
tabs = [hxs.xpath('//div[#class="lotusBottomCorner"]/span/ul/li |//div[#class="q-otherItem"]/h4 | //div[#class="q-folderItem"]/h4')]
for tab in tabs:
kmTab = CrawlkmssItem()
kmTab['title'] = tab.xpath(
"a[contains(#class,'qtocsprite')]/text()").extract()
kmTab['url'] = tab.xpath(
"a[contains(#class,'qtocsprite')]/#href").extract()
kmTab['fileurl'] = tab.xpath('a/#href').extract()
kmTab['filename'] = tab.xpath('a/text()').extract()
kmTab['folderurl'] = tab.xpath('a/#href').extract()
kmTab['foldername'] = tab.xpath('a/test()').extract()
yield kmTab
I have my first crawling project written as above. My task is to extract information from our company's intranet ( my computer has configured to access the intranet.)
QuestionB :
is it possible to crawl intranet?
The intranet requires authentication except for the loginpage(loginp)
( I used '~' to hide the actual site as it is not supposed to publish, but all (~)s are identical)
I supplied the log-in activity with function login, in which I implemented it by referring to previous questions answered in stackoverflow. However, when I have to input the 'if something in response.body" at function check_login_response,
QuestionC :
I have no idea what should I input to replace the 'something'
After logging in( where I have no idea how to know it have logged in or not), I should be able to go through every url found from accessing start_urls and it should keep running through every possible url with linkextractor under the format mentioned below.
QuestionD :
And since the spider start with
https://~/LotusQuickr/dept/Main.nsf
while all the urls follows the format:
https://~/LotusQuickr/dept/... ( some with Main.nsf and some without Main.nsf)
I have to use allow=[''] under Rule so it could work for urls with the format above. Am I correct? ( which is also listed under allowed_domains
With the selector: I need to extract three types of information.
1)
I need the href and text() ( if there exists the two elements) for each <li> under the <div> of class lotusBottomCorner
2)
I also need the href and text() ( if there exists the two elements) for each <h4> under each <td> and with the class q-folderItem( if there exists this class)
3)
At last I would need the href and text() ( if there exists the two elements) for each <h4> under each <td> and with the class q-otherItem( if there exists this class)
QuestionE :
I have tested with my chrome console to make sure they work. However, when I extended the selector with |, they no longer work. How should i fix it or restructure it so that I could obtain all three information for every page?
I have my items.py as below:
import scrapy
class CrawlkmssItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
foldername=scrapy.Field()
folderurl=scrapy.Field()
filename=scrapy.Field()
fileurl=scrapy.Field()
Sorry of asking such a lengthy question. I am very new to scrapy and I have already read through several tutorials and documentations. Yet, I still did not manage to implement it.
I really appreciate all the helps!
I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),