I am very new to Python. I am trying to print (and save) all blog posts in a website using scrapy. I want the spider to crawl only in the main content section. This is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from people.items import PeopleCommentItem
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=[Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#class="articalContent "]//a/#href')))
]
def parse(self,response):
hxs=HtmlXPathSelector(response)
print hxs.select('//div[#class="articalContent "]//a/text()').extract()
Nothing is printed after:
DEBUG: Crawled (200) <GET http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html> (referer: None)
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
hxs=HtmlXPathSelector(response)
ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
titles= hxs.select('//div[#class="articalContent "]//a/text()').extract()
2015-03-09 15:46:47-0700 [people] INFO: Closing spider (finished)
Can somebody advise what is wrong?
Thanks!!
I had some success with this:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=(Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(#class, "articalContent")]'))),
)
def parse(self,response):
links = Selector(text=response.body).xpath('//div[contains(#class, "articalContent")]//a//text()')
for link in links:
print link.extract()
Related
I found this example code in a textbook about web scraping. After running the spider it showing error and found out that scrapy.contrib is removed in 1.16 release of scrapy. How should i change this so it work. I am new to web scraping btw.
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class ArticleSpider(CrawlSpider):
name = 'articles'
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/'
'Benevolent_dictator_for_life']
rules = [Rule(LinkExtractor(allow='.*'), callback='parse_items',
follow=True)]
def parse_items(self, response):
url = response.url
title = response.css('h1::text').extract_first()
text = response.xpath('//div[#id="mw-content-text"]//text()').extract()
lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
lastUpdated = lastUpdate.replace(
'This page was last edited on ','')
print('URL is: {}'.format(url))
print('title is: {}'.format(title))
print('text is: {}'.format(text))
print('Last updated: {}'.format(lastUpdated))
In newer versions of scrapy you can simply import the modules as below
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# add the rest of the code
Read more from the docs.
I am trying to learn more advanced scrapy options, working with response.meta and parsing data from followed page. Written code does work, it visits all intended pages but does not scrape data from all of them.
I tried changing rules for following links inside of LinkExtractor and restricting xpaths to different areas of website, but this does not change behavior of scrapy. I also tried NOT to use regex 'r/' but this doesn't change anything besides scrapy wandering off through out whole page.
EDIT: I think problem lies within def category_page, where i am doing next_page navigation in category page. If i remove this function and following of the links scrapy gets all results from page.
What i try to accomplish is:
Visit category page in start_urls
Extract all defined items from /product/view and /pref_product/view following from category page. Follow further from those to /member/view
Extract all defined items on /member/view page
Iterate further to next_page in category from start_urls
Scrapy does all of those things, but misses big part of data!
For example, sample of a log. None of those pages were scraped.
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/275725/car-elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/239895/guide-roller.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/289815/elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
Here is code i am using
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from urlparse import urljoin
from scrapy import Selector
from go4world.items import Go4WorldItem
class ElectronicsSpider(CrawlSpider):
name = "m17"
allowed_domains = ["go4worldbusiness.com"]
start_urls = [
'https://www.go4worldbusiness.com/suppliers/furniture-interior-decoration-furnishings.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/agri-food-processing-machinery-equipment.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/alcoholic-beverages-tobacco-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/bar-accessories-and-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?pg_suppliers=1'
]
rules = (
Rule(LinkExtractor(allow=(r'/furniture-interior-decoration-furnishings.html?',
r'/furniture-interior-decoration-furnishings.html?',
r'/agri-food-processing-machinery-equipment.html?',
r'/alcoholic-beverages-tobacco-related-products.html?',
r'/bar-accessories-and-related-products.html?',
r'/elevators-escalators.html?'
), restrict_xpaths=('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul'), ),
callback="category_page",
follow=True),
Rule(LinkExtractor(allow=('/product/view/', '/pref_product/view/'), restrict_xpaths=('//div[4]/div[1]/..'), ),
callback="parse_attr",
follow=False),
Rule(LinkExtractor(restrict_xpaths=('/div[4]/div[1]/..'), ),
callback="category_page",
follow=False),
)
BASE_URL = 'https://www.go4worldbusiness.com'
def category_page(self,response):
next_page = response.xpath('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul/#href').extract()
for item in self.parse_attr(response):
yield item
if next_page:
path = next_page.extract_first()
nextpage = response.urljoin(path)
yield scrapy.Request(nextpage,callback=category_page)
def parse_attr(self, response):
for resource in response.xpath('//div[4]/div[1]/..'):
item = Go4WorldItem()
item['NameOfProduct'] = response.xpath('//div[4]/div[1]/div[1]/div/h1/text()').extract()
item['NameOfCompany'] = response.xpath('//div[4]/div[1]/div[2]/div[1]/span/span/a/text()').extract()
item['Country'] = response.xpath('//div[4]/div[1]/div[3]/div/div[1]/text()').extract()
company_page = response.urljoin(resource.xpath('//div[4]/div[1]/div[4]/div/ul/li[1]/a/#href').extract_first())
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
def company_data(self, response):
item = response.meta['item']
item['CompanyTags'] = response.xpath('//div[4]/div[1]/div[6]/div/div[1]/a/text()').extract()
item['Contact'] = response.xpath('//div[4]/div[1]/div[5]/div/address/text()').extract()
yield item
I want scrapy to grab data from all crawled links. I cannot understand where lies an error which stops scrapy from scraping from certain pages.
I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks
I'm trying to scrape the results from certain keywords using the advanced search form of The Guardian.
from scrapy.spider import Spider
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
class IndependentSpider(Spider):
name = "IndependentSpider"
start_urls= ["http://www.independent.co.uk/advancedsearch"]
def parse(self, response):
yield [FormRequest.from_response(response, formdata={"all": "Science"}, callback=self.parse_results)]
def parse_results(self):
hxs = HtmlXPathSelector(response)
print hxs.select('//h3').extract()
The form redirects me to
DEBUG: Redirecting (301) to <GET http://www.independent.co.uk/ind/advancedsearch/> from <GET http://www.independent.co.uk/advancedsearch>
which is a page that doesn't seem to exist.
Do you know what I am doing wrong?
Thanks!
It seems you need a trailing /.
Try start_urls= ["http://www.independent.co.uk/advancedsearch/"]
I am trying to get a scrapy spider working, but there seems to be a problem with SgmlLinkExtractor.
Here is the signature:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
I am using the allow() option, here is my code:
start_urls = ['http://bigbangtrans.wordpress.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
A sample url looks like http://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/
the output of scrapy crawl tbbt contains
[tbbt] DEBUG: Crawled (200) http://bigbangtrans.wordpress.com/series-3-episode-17-the-precious-fragmentation/> (referer: http://bigbangtrans.wordpress.com)
The parse_item callback, however, is not called and I can not figure out why.
This is the whole spider code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class TbbtSpider(CrawlSpider):
#print '\n TbbtSpider \n'
name = 'tbbt'
start_urls = ['http://bigbangtrans.wordpress.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
def parse_item(self, response):
print '\n parse_blogpost \n'
hxs = HtmlXPathSelector(response)
item = TbbtItem()
# Extract title
item['title'] = hxs.select('//div[#id="post-5"]/div/p/span/text()').extract() # XPath selector for title
return item
Okay, so the reason this code is not working is because the syntax of your rule is incorrect.I fixed the syntax without making any other changes and I was able to hit the parse_item callback.
rules = (
Rule(SgmlLinkExtractor(allow=(r'series-\d{1}-episode-\d{2}.',),
),
callback='parse_item'),
)
However the titles were all blank which suggests that the hxs.select statement in parse_item is incorrect. The following xpath may be more suitable (I made an educated gues about the required title, but I could be barking up the wrong tree entirely)
item['title'] = hxs.select('//h2[#class="title"]/text()').extract()