Scrapy doesn't retrive data to items from some pages - python

I am trying to learn more advanced scrapy options, working with response.meta and parsing data from followed page. Written code does work, it visits all intended pages but does not scrape data from all of them.
I tried changing rules for following links inside of LinkExtractor and restricting xpaths to different areas of website, but this does not change behavior of scrapy. I also tried NOT to use regex 'r/' but this doesn't change anything besides scrapy wandering off through out whole page.
EDIT: I think problem lies within def category_page, where i am doing next_page navigation in category page. If i remove this function and following of the links scrapy gets all results from page.
What i try to accomplish is:
Visit category page in start_urls
Extract all defined items from /product/view and /pref_product/view following from category page. Follow further from those to /member/view
Extract all defined items on /member/view page
Iterate further to next_page in category from start_urls
Scrapy does all of those things, but misses big part of data!
For example, sample of a log. None of those pages were scraped.
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/275725/car-elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/239895/guide-roller.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/289815/elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
Here is code i am using
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from urlparse import urljoin
from scrapy import Selector
from go4world.items import Go4WorldItem
class ElectronicsSpider(CrawlSpider):
name = "m17"
allowed_domains = ["go4worldbusiness.com"]
start_urls = [
'https://www.go4worldbusiness.com/suppliers/furniture-interior-decoration-furnishings.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/agri-food-processing-machinery-equipment.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/alcoholic-beverages-tobacco-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/bar-accessories-and-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?pg_suppliers=1'
]
rules = (
Rule(LinkExtractor(allow=(r'/furniture-interior-decoration-furnishings.html?',
r'/furniture-interior-decoration-furnishings.html?',
r'/agri-food-processing-machinery-equipment.html?',
r'/alcoholic-beverages-tobacco-related-products.html?',
r'/bar-accessories-and-related-products.html?',
r'/elevators-escalators.html?'
), restrict_xpaths=('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul'), ),
callback="category_page",
follow=True),
Rule(LinkExtractor(allow=('/product/view/', '/pref_product/view/'), restrict_xpaths=('//div[4]/div[1]/..'), ),
callback="parse_attr",
follow=False),
Rule(LinkExtractor(restrict_xpaths=('/div[4]/div[1]/..'), ),
callback="category_page",
follow=False),
)
BASE_URL = 'https://www.go4worldbusiness.com'
def category_page(self,response):
next_page = response.xpath('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul/#href').extract()
for item in self.parse_attr(response):
yield item
if next_page:
path = next_page.extract_first()
nextpage = response.urljoin(path)
yield scrapy.Request(nextpage,callback=category_page)
def parse_attr(self, response):
for resource in response.xpath('//div[4]/div[1]/..'):
item = Go4WorldItem()
item['NameOfProduct'] = response.xpath('//div[4]/div[1]/div[1]/div/h1/text()').extract()
item['NameOfCompany'] = response.xpath('//div[4]/div[1]/div[2]/div[1]/span/span/a/text()').extract()
item['Country'] = response.xpath('//div[4]/div[1]/div[3]/div/div[1]/text()').extract()
company_page = response.urljoin(resource.xpath('//div[4]/div[1]/div[4]/div/ul/li[1]/a/#href').extract_first())
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
def company_data(self, response):
item = response.meta['item']
item['CompanyTags'] = response.xpath('//div[4]/div[1]/div[6]/div/div[1]/a/text()').extract()
item['Contact'] = response.xpath('//div[4]/div[1]/div[5]/div/address/text()').extract()
yield item
I want scrapy to grab data from all crawled links. I cannot understand where lies an error which stops scrapy from scraping from certain pages.

Related

Scrapy doesn't do any scraping after login on salesforce.com based site

I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks

scrapy link extractor adds equal signs to the end of links

I'm trying to parse a forum with this rule:
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item', follow=True),)
I've tried several approaches with/without r at the beginning, with/without $ at the end of the pattern etc. but every time scrapy produces links ending with equal sign even though there is no = in links neither on the page nor in pattern.
There is an example of extracted links (using also parse_start_url so the start url is here too and yes, I've tried to delete it - it doesn't help):
[<GET http://www.example.com/index.php?threads/topic.0000/>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-2=>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-3=>]
If I open in browser or fetch in scrapy shell these links I get wrong pages with nothing to parse but deleting equal signs solves the problem.
So why is it happening and how can I handle it?
EDIT 1 (additional info):
Scrapy 1.0.3;
Other CrawlSpiders are fine.
EDIT 2:
Spider's code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class BmwclubSpider(CrawlSpider):
name = "bmwclub"
allowed_domains = ["www.bmwclub.ru"]
start_urls = []
start_url_objects = []
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item'),)
def parse_start_url(self, response):
return Request(url = response.url, callback=self.parse_item, meta={'site_url': response.url})
def parse_item(self, response):
return []
Command to collect links:
scrapy parse http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/ --noitems --spider bmwclub
Output of the command:
>>> STATUS DEPTH LEVEL 1 <<<
# Requests -----------------------------------------------------------------
[<GET http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-2=>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-3=>]
this is because of canonicalization issues.
You can disable it on the LinkExtractor like this:
rules = (
Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'),
)

InitSpider not crawling or capturing data

Quite unsure with the informations available which class I should be inheriting from for a crawling spider.
My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:
I get this (new line, followed by left hard bracket):
JSON file
[
Console output
DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)
When running this:
scrapy crawl MY_crawler -o items.json
Using spider:
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem
class MyCrawlerSpider(InitSpider):
name = "MY_crawler"
allowed_domains = ["mydomain.com"]
login_page = 'https://www.mydomain.com/users/sign_in'
start_urls = [
"https://www.mydomain.com/",
]
rules = (
#requires trailing comma to force iterable vs tuple
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
auth_token = response.xpath('authxpath').extract()[0]
return FormRequest.from_response(
response,
formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
callback=self.check_login_response)
def check_login_response(self, response):
if "Signed in successfully" in response.body:
self.log("am logged in")
self.initialized()
else:
self.log("couldn't login")
print response.body
def parse_item(self, response):
item = MycrawlerItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0]
yield item

Scrapy search form follows an inexistent page

I'm trying to scrape the results from certain keywords using the advanced search form of The Guardian.
from scrapy.spider import Spider
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
class IndependentSpider(Spider):
name = "IndependentSpider"
start_urls= ["http://www.independent.co.uk/advancedsearch"]
def parse(self, response):
yield [FormRequest.from_response(response, formdata={"all": "Science"}, callback=self.parse_results)]
def parse_results(self):
hxs = HtmlXPathSelector(response)
print hxs.select('//h3').extract()
The form redirects me to
DEBUG: Redirecting (301) to <GET http://www.independent.co.uk/ind/advancedsearch/> from <GET http://www.independent.co.uk/advancedsearch>
which is a page that doesn't seem to exist.
Do you know what I am doing wrong?
Thanks!
It seems you need a trailing /.
Try start_urls= ["http://www.independent.co.uk/advancedsearch/"]

Scrapy is not downloading the tags from the responses

I am presently working on scrapy, below is my spider.py code
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = {"careers-preftherapy.icims.com"}
start_urls = [
"https://careers-preftherapy.icims.com/jobs/search"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
pageCount = hxs.select('//td[#class = "iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstrip()[-2:].strip()
for i in range(1,int(pageCount)+1):
yield Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i, callback=self.parsePage)
def parsePage(self, response):
hxs = HtmlXPathSelector(response)
urls_list_odd_id = hxs.select('//table[#class="iCIMS_JobsTable"]/tr/td[#class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/#href').extract()
print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
urls_list_even_id = hxs.select('//table[#class="iCIMS_JobsTable"]/tr/td[#class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/#href').extract()
print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
urls_list = []
urls_list.extend(urls_list_odd_id)
urls_list.extend(urls_list_even_id)
for i in urls_list:
yield Request(i.encode('utf-8'), callback=self.parseJob)
def parseJob(self, response):
pass
Here after opening the page i am achieving pagination like
https://careers-preftherapy.icims.com/jobs/search?pr=1
https://careers-preftherapy.icims.com/jobs/search?pr=2
...........so on
I yielded request for each url(suppose here there are 6 pages).When scrapy reached 1st url
i am trying to collect all href tags from the first url
(https://careers-preftherapy.icims.com/jobs/search?pr=1)
and when it reaches second url same collecting all href tags.
Now in my code as u see there are totally 20 href tags in each page in that 10 href tags are under td[#class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"] \
and remaining are under td[#class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"] .
What the problem is here scrapy some times downloading the tags and some times not i dont know whats happening, i mean when we run spider file two times it is downloading and when another time its returning an empty list like below
1st time run:
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[] >>>>>>>odddddd>>>>>>>>>>>>>>>>
[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>
Second time run
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[u'https://careers-preftherapy.icims.com/jobs/1836/job', u'https://careers-preftherapy.icims.com/jobs/1813/job', u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>
[preftherapy.icims.com/jobs/1811/job', u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>
My question is why it is sometimes downloading and sometimes not, please try to reply me its really helpful for me.
Thanks in advance.....
What the problem is here scrapy some times downloading the tags and
some times not i dont know whats happening
To understand what is happening you should debug. My guess is that your xpath query returns an empty list, because you got an unexpected page.
Do something like:
def parsePage(self, response):
hxs = HtmlXPathSelector(response)
urls_list_odd_id = hxs.select('//table[#class="iCIMS_JobsTable"]/tr/td[#class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/#href').extract()
print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
urls_list_even_id = hxs.select('//table[#class="iCIMS_JobsTable"]/tr/td[#class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/#href').extract()
print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
if not urls_list_odd_id or not urls_list_odd_id:
from scrapy.shell import inspect_response
inspect_response(response)
urls_list = []
urls_list.extend(urls_list_odd_id)
urls_list.extend(urls_list_even_id)
for i in urls_list:
yield Request(i.encode('utf-8'), callback=self.parseJob)
When you'll get to the shell type view(response) to view the downloaded page in browser (for example in Firefox) and you will be able to test your xpath queries and find out why they return nothing.
Here is more info about scrapy shell.
You can use open_in_browser() to open the response in the browser:
def parsePage(self, response):
from scrapy.utils.response import open_in_browser
open_in_browser(response)

Categories