I'm trying to scrape the results from certain keywords using the advanced search form of The Guardian.
from scrapy.spider import Spider
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
class IndependentSpider(Spider):
name = "IndependentSpider"
start_urls= ["http://www.independent.co.uk/advancedsearch"]
def parse(self, response):
yield [FormRequest.from_response(response, formdata={"all": "Science"}, callback=self.parse_results)]
def parse_results(self):
hxs = HtmlXPathSelector(response)
print hxs.select('//h3').extract()
The form redirects me to
DEBUG: Redirecting (301) to <GET http://www.independent.co.uk/ind/advancedsearch/> from <GET http://www.independent.co.uk/advancedsearch>
which is a page that doesn't seem to exist.
Do you know what I am doing wrong?
Thanks!
It seems you need a trailing /.
Try start_urls= ["http://www.independent.co.uk/advancedsearch/"]
Related
I am trying to learn more advanced scrapy options, working with response.meta and parsing data from followed page. Written code does work, it visits all intended pages but does not scrape data from all of them.
I tried changing rules for following links inside of LinkExtractor and restricting xpaths to different areas of website, but this does not change behavior of scrapy. I also tried NOT to use regex 'r/' but this doesn't change anything besides scrapy wandering off through out whole page.
EDIT: I think problem lies within def category_page, where i am doing next_page navigation in category page. If i remove this function and following of the links scrapy gets all results from page.
What i try to accomplish is:
Visit category page in start_urls
Extract all defined items from /product/view and /pref_product/view following from category page. Follow further from those to /member/view
Extract all defined items on /member/view page
Iterate further to next_page in category from start_urls
Scrapy does all of those things, but misses big part of data!
For example, sample of a log. None of those pages were scraped.
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/275725/car-elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/239895/guide-roller.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/289815/elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
Here is code i am using
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from urlparse import urljoin
from scrapy import Selector
from go4world.items import Go4WorldItem
class ElectronicsSpider(CrawlSpider):
name = "m17"
allowed_domains = ["go4worldbusiness.com"]
start_urls = [
'https://www.go4worldbusiness.com/suppliers/furniture-interior-decoration-furnishings.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/agri-food-processing-machinery-equipment.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/alcoholic-beverages-tobacco-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/bar-accessories-and-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?pg_suppliers=1'
]
rules = (
Rule(LinkExtractor(allow=(r'/furniture-interior-decoration-furnishings.html?',
r'/furniture-interior-decoration-furnishings.html?',
r'/agri-food-processing-machinery-equipment.html?',
r'/alcoholic-beverages-tobacco-related-products.html?',
r'/bar-accessories-and-related-products.html?',
r'/elevators-escalators.html?'
), restrict_xpaths=('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul'), ),
callback="category_page",
follow=True),
Rule(LinkExtractor(allow=('/product/view/', '/pref_product/view/'), restrict_xpaths=('//div[4]/div[1]/..'), ),
callback="parse_attr",
follow=False),
Rule(LinkExtractor(restrict_xpaths=('/div[4]/div[1]/..'), ),
callback="category_page",
follow=False),
)
BASE_URL = 'https://www.go4worldbusiness.com'
def category_page(self,response):
next_page = response.xpath('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul/#href').extract()
for item in self.parse_attr(response):
yield item
if next_page:
path = next_page.extract_first()
nextpage = response.urljoin(path)
yield scrapy.Request(nextpage,callback=category_page)
def parse_attr(self, response):
for resource in response.xpath('//div[4]/div[1]/..'):
item = Go4WorldItem()
item['NameOfProduct'] = response.xpath('//div[4]/div[1]/div[1]/div/h1/text()').extract()
item['NameOfCompany'] = response.xpath('//div[4]/div[1]/div[2]/div[1]/span/span/a/text()').extract()
item['Country'] = response.xpath('//div[4]/div[1]/div[3]/div/div[1]/text()').extract()
company_page = response.urljoin(resource.xpath('//div[4]/div[1]/div[4]/div/ul/li[1]/a/#href').extract_first())
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
def company_data(self, response):
item = response.meta['item']
item['CompanyTags'] = response.xpath('//div[4]/div[1]/div[6]/div/div[1]/a/text()').extract()
item['Contact'] = response.xpath('//div[4]/div[1]/div[5]/div/address/text()').extract()
yield item
I want scrapy to grab data from all crawled links. I cannot understand where lies an error which stops scrapy from scraping from certain pages.
I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks
I'm trying to parse a forum with this rule:
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item', follow=True),)
I've tried several approaches with/without r at the beginning, with/without $ at the end of the pattern etc. but every time scrapy produces links ending with equal sign even though there is no = in links neither on the page nor in pattern.
There is an example of extracted links (using also parse_start_url so the start url is here too and yes, I've tried to delete it - it doesn't help):
[<GET http://www.example.com/index.php?threads/topic.0000/>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-2=>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-3=>]
If I open in browser or fetch in scrapy shell these links I get wrong pages with nothing to parse but deleting equal signs solves the problem.
So why is it happening and how can I handle it?
EDIT 1 (additional info):
Scrapy 1.0.3;
Other CrawlSpiders are fine.
EDIT 2:
Spider's code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class BmwclubSpider(CrawlSpider):
name = "bmwclub"
allowed_domains = ["www.bmwclub.ru"]
start_urls = []
start_url_objects = []
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item'),)
def parse_start_url(self, response):
return Request(url = response.url, callback=self.parse_item, meta={'site_url': response.url})
def parse_item(self, response):
return []
Command to collect links:
scrapy parse http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/ --noitems --spider bmwclub
Output of the command:
>>> STATUS DEPTH LEVEL 1 <<<
# Requests -----------------------------------------------------------------
[<GET http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-2=>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-3=>]
this is because of canonicalization issues.
You can disable it on the LinkExtractor like this:
rules = (
Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'),
)
I am very new to Python. I am trying to print (and save) all blog posts in a website using scrapy. I want the spider to crawl only in the main content section. This is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from people.items import PeopleCommentItem
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=[Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#class="articalContent "]//a/#href')))
]
def parse(self,response):
hxs=HtmlXPathSelector(response)
print hxs.select('//div[#class="articalContent "]//a/text()').extract()
Nothing is printed after:
DEBUG: Crawled (200) <GET http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html> (referer: None)
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
hxs=HtmlXPathSelector(response)
ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
titles= hxs.select('//div[#class="articalContent "]//a/text()').extract()
2015-03-09 15:46:47-0700 [people] INFO: Closing spider (finished)
Can somebody advise what is wrong?
Thanks!!
I had some success with this:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=(Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(#class, "articalContent")]'))),
)
def parse(self,response):
links = Selector(text=response.body).xpath('//div[contains(#class, "articalContent")]//a//text()')
for link in links:
print link.extract()
Quite unsure with the informations available which class I should be inheriting from for a crawling spider.
My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:
I get this (new line, followed by left hard bracket):
JSON file
[
Console output
DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)
When running this:
scrapy crawl MY_crawler -o items.json
Using spider:
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem
class MyCrawlerSpider(InitSpider):
name = "MY_crawler"
allowed_domains = ["mydomain.com"]
login_page = 'https://www.mydomain.com/users/sign_in'
start_urls = [
"https://www.mydomain.com/",
]
rules = (
#requires trailing comma to force iterable vs tuple
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
auth_token = response.xpath('authxpath').extract()[0]
return FormRequest.from_response(
response,
formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
callback=self.check_login_response)
def check_login_response(self, response):
if "Signed in successfully" in response.body:
self.log("am logged in")
self.initialized()
else:
self.log("couldn't login")
print response.body
def parse_item(self, response):
item = MycrawlerItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0]
yield item