Quite unsure with the informations available which class I should be inheriting from for a crawling spider.
My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:
I get this (new line, followed by left hard bracket):
JSON file
[
Console output
DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)
When running this:
scrapy crawl MY_crawler -o items.json
Using spider:
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem
class MyCrawlerSpider(InitSpider):
name = "MY_crawler"
allowed_domains = ["mydomain.com"]
login_page = 'https://www.mydomain.com/users/sign_in'
start_urls = [
"https://www.mydomain.com/",
]
rules = (
#requires trailing comma to force iterable vs tuple
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
auth_token = response.xpath('authxpath').extract()[0]
return FormRequest.from_response(
response,
formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
callback=self.check_login_response)
def check_login_response(self, response):
if "Signed in successfully" in response.body:
self.log("am logged in")
self.initialized()
else:
self.log("couldn't login")
print response.body
def parse_item(self, response):
item = MycrawlerItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0]
yield item
Related
I am working on a project for which I have to scrape a website "http://app.bmiet.net/student/login" after logging into it. However I can't login using scrapy. I think its because my code is unable to read the CSRF code from the website, however I am still learning to use scrapy and so I am not sure. Please Help me with my code and do tell me whatmy mistake was. The code is given below.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class spidey(scrapy.Spider):
name = 'spidyy'
start_urls = [
'http://app.bmiet.net/student/login'
]
def parse(self, response):
token = response.css('form input::attr(value)').extract_first()
return FormRequest.from_response(response, formdata={
'csrf_token' : token,
'username' : '//username//',
'password' : '//password//'
}, callback = self.start_scrapping)
def start_scrapping(self, response):
open_in_browser(response)
all = response.css('.table-hover td')
for x in all:
att = x.css('td:nth-child(2)::text').extract()
sub = x.css('td~ td+ td::text').extract()
yield {
'Subject': sub,
'Status': att
}
I have removed username and password for obvious reasons.
I am also Sharing what I am getting at the terminal on running the program.
2020-03-21 17:06:49 [scrapy.core.engine] INFO: Spider opened
2020-03-21 17:06:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-21 17:06:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-21 17:06:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.bmiet.net/robots.txt> (referer: None)
2020-03-21 17:06:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.bmiet.net/student/login> (referer: None)
2020-03-21 17:06:54 [scrapy.core.scraper] ERROR: Spider error processing <GET http://app.bmiet.net/student/login> (referer: None)
Traceback (most recent call last):
File "c:\users\administrator\pycharmprojects\sarthak_project\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\administrator\pycharmprojects\sarthak_project\venv\lib\site-packages\scrapy\spiders\__init__.py", line 84, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: spidey.parse callback is not defined
2020-03-21 17:06:54 [scrapy.core.engine] INFO: Closing spider (finished)
I would suggest you reformat your code an indent the methods so that they are part of the class like so:
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class spidey(scrapy.Spider):
name = 'spidyy'
start_urls = [
'http://app.bmiet.net/student/login'
]
def parse(self, response):
token = response.css('form input::attr(value)').extract_first()
return FormRequest.from_response(response, formdata={
'csrf_token' : token,
'username' : '//username//',
'password' : '//password//'
}, callback = self.start_scrapping)
def start_scrapping(self, response):
open_in_browser(response)
all = response.css('.table-hover td')
for x in all:
att = x.css('td:nth-child(2)::text').extract()
sub = x.css('td~ td+ td::text').extract()
yield {
'Subject': sub,
'Status': att
}
I am trying to learn more advanced scrapy options, working with response.meta and parsing data from followed page. Written code does work, it visits all intended pages but does not scrape data from all of them.
I tried changing rules for following links inside of LinkExtractor and restricting xpaths to different areas of website, but this does not change behavior of scrapy. I also tried NOT to use regex 'r/' but this doesn't change anything besides scrapy wandering off through out whole page.
EDIT: I think problem lies within def category_page, where i am doing next_page navigation in category page. If i remove this function and following of the links scrapy gets all results from page.
What i try to accomplish is:
Visit category page in start_urls
Extract all defined items from /product/view and /pref_product/view following from category page. Follow further from those to /member/view
Extract all defined items on /member/view page
Iterate further to next_page in category from start_urls
Scrapy does all of those things, but misses big part of data!
For example, sample of a log. None of those pages were scraped.
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/275725/car-elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/239895/guide-roller.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
DEBUG: Crawled (200) <GET https://www.go4worldbusiness.com/product/view/289815/elevator.html> (referer: https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?region=worldwide&pg_suppliers=5)
Here is code i am using
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from urlparse import urljoin
from scrapy import Selector
from go4world.items import Go4WorldItem
class ElectronicsSpider(CrawlSpider):
name = "m17"
allowed_domains = ["go4worldbusiness.com"]
start_urls = [
'https://www.go4worldbusiness.com/suppliers/furniture-interior-decoration-furnishings.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/agri-food-processing-machinery-equipment.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/alcoholic-beverages-tobacco-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/bar-accessories-and-related-products.html?pg_suppliers=1',
'https://www.go4worldbusiness.com/suppliers/elevators-escalators.html?pg_suppliers=1'
]
rules = (
Rule(LinkExtractor(allow=(r'/furniture-interior-decoration-furnishings.html?',
r'/furniture-interior-decoration-furnishings.html?',
r'/agri-food-processing-machinery-equipment.html?',
r'/alcoholic-beverages-tobacco-related-products.html?',
r'/bar-accessories-and-related-products.html?',
r'/elevators-escalators.html?'
), restrict_xpaths=('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul'), ),
callback="category_page",
follow=True),
Rule(LinkExtractor(allow=('/product/view/', '/pref_product/view/'), restrict_xpaths=('//div[4]/div[1]/..'), ),
callback="parse_attr",
follow=False),
Rule(LinkExtractor(restrict_xpaths=('/div[4]/div[1]/..'), ),
callback="category_page",
follow=False),
)
BASE_URL = 'https://www.go4worldbusiness.com'
def category_page(self,response):
next_page = response.xpath('//div[4]/div[1]/div[2]/div/div[2]/div/div/div[23]/ul/#href').extract()
for item in self.parse_attr(response):
yield item
if next_page:
path = next_page.extract_first()
nextpage = response.urljoin(path)
yield scrapy.Request(nextpage,callback=category_page)
def parse_attr(self, response):
for resource in response.xpath('//div[4]/div[1]/..'):
item = Go4WorldItem()
item['NameOfProduct'] = response.xpath('//div[4]/div[1]/div[1]/div/h1/text()').extract()
item['NameOfCompany'] = response.xpath('//div[4]/div[1]/div[2]/div[1]/span/span/a/text()').extract()
item['Country'] = response.xpath('//div[4]/div[1]/div[3]/div/div[1]/text()').extract()
company_page = response.urljoin(resource.xpath('//div[4]/div[1]/div[4]/div/ul/li[1]/a/#href').extract_first())
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
def company_data(self, response):
item = response.meta['item']
item['CompanyTags'] = response.xpath('//div[4]/div[1]/div[6]/div/div[1]/a/text()').extract()
item['Contact'] = response.xpath('//div[4]/div[1]/div[5]/div/address/text()').extract()
yield item
I want scrapy to grab data from all crawled links. I cannot understand where lies an error which stops scrapy from scraping from certain pages.
I'm very new to scrapy, while am running my code, am getting this error.
My Code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract home page search results
def parse(self, response):
for link in response.xpath('//div[#id="seriesDiv"]//table[#class="tableFile2"]/a/#href').extract():
req = Request(url = link, callback = self.parse_page)
print link
yield req
#extract second link search results
def parse_second(self, response):
for link in response.xpath('//div[#id="seriesDiv"]//table[#class="tableFile2"]//*[#id="documentsbutton"]/a/#href').extract():
req = Request(url = link, callback = self.parse_page)
print link
yield req
Once I tried to run this code : scrapy crawl sec_gov am getting this error.
2018-11-14 15:37:26 [scrapy.core.engine] INFO: Spider opened
2018-11-14 15:37:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-14 15:37:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-14 15:37:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
2018-11-14 15:37:27 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
Traceback (most recent call last):
File "/home/surukam/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/surukam/.local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: legco.parse callback is not defined
2018-11-14 15:37:27 [scrapy.core.engine] INFO: Closing spider (finished)
Can anyone help me with this ? Thanks in advance
Your code should not run at all. There are several things to fix in order for your script to run. Where have you found this self.parse_page and what is it doing within your script? Your script is badly indented. I've fixed the script which is now able to track each url from it's landing page connected to concerning links to the documentation in it's inner page. Try this to get the content.
import scrapy
class legco(scrapy.Spider):
name = "sec_gov"
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
def parse(self, response):
for link in response.xpath('//table[#summary="Results"]//td[#scope="row"]/a/#href').extract():
absoluteLink = response.urljoin(link)
yield scrapy.Request(url = absoluteLink, callback = self.parse_page)
def parse_page(self, response):
for links in response.xpath('//table[#summary="Results"]//a[#id="documentsbutton"]/#href').extract():
targetLink = response.urljoin(links)
yield {"links":targetLink}
I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks
I'm having trouble with Python Scrapy.
I have a spider that attempts to login to a site before crawling it, however the site is configured to return response code HTTP 401 on the login page which stops the spider from continuing (even though in the body of that response, the login form is there for submitting).
This is the relevant parts of my crawler:
class LoginSpider(Spider):
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)
The above yields:
2014-02-23 11:52:09+0000 [login] DEBUG: Crawled (401) <GET https://example.com/login> (referer: None)
2014-02-23 11:52:09+0000 [login] INFO: Closing spider (finished)
However if I give it another URL to start on (not the login page) which returns a 200:
2014-02-23 11:50:19+0000 [login] DEBUG: Crawled (200) <GET https://example.com/other-page> (referer: None)
2014-02-23 11:50:19+0000 [login] INFO: Logging in...
You see it goes on to execute my parse() method and make the log entry.
How do I make Scrapy continue to work with the page despite a 401 response code?
On the off-chance this question isn't closed as a duplicate, explicitly adding 401 to handle_httpstatus_list fixed the issue
class LoginSpider(Spider):
handle_httpstatus_list = [401]
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)