scrapy ERROR: Spider error processing issue - python

I'm very new to scrapy, while am running my code, am getting this error.
My Code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract home page search results
def parse(self, response):
for link in response.xpath('//div[#id="seriesDiv"]//table[#class="tableFile2"]/a/#href').extract():
req = Request(url = link, callback = self.parse_page)
print link
yield req
#extract second link search results
def parse_second(self, response):
for link in response.xpath('//div[#id="seriesDiv"]//table[#class="tableFile2"]//*[#id="documentsbutton"]/a/#href').extract():
req = Request(url = link, callback = self.parse_page)
print link
yield req
Once I tried to run this code : scrapy crawl sec_gov am getting this error.
2018-11-14 15:37:26 [scrapy.core.engine] INFO: Spider opened
2018-11-14 15:37:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-14 15:37:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-14 15:37:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
2018-11-14 15:37:27 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
Traceback (most recent call last):
File "/home/surukam/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/surukam/.local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: legco.parse callback is not defined
2018-11-14 15:37:27 [scrapy.core.engine] INFO: Closing spider (finished)
Can anyone help me with this ? Thanks in advance

Your code should not run at all. There are several things to fix in order for your script to run. Where have you found this self.parse_page and what is it doing within your script? Your script is badly indented. I've fixed the script which is now able to track each url from it's landing page connected to concerning links to the documentation in it's inner page. Try this to get the content.
import scrapy
class legco(scrapy.Spider):
name = "sec_gov"
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
def parse(self, response):
for link in response.xpath('//table[#summary="Results"]//td[#scope="row"]/a/#href').extract():
absoluteLink = response.urljoin(link)
yield scrapy.Request(url = absoluteLink, callback = self.parse_page)
def parse_page(self, response):
for links in response.xpath('//table[#summary="Results"]//a[#id="documentsbutton"]/#href').extract():
targetLink = response.urljoin(links)
yield {"links":targetLink}

Related

SCRAPY FORM REQUEST doesn't return any data

I was making a form request to a website. The request is made successfully but it's not returning any data.
LOGS:
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
MY CODE:
# -*- coding: utf-8 -*-
import scrapy
codes = open('codes.txt').read().split('\n')
class MainSpider(scrapy.Spider):
name = 'main'
form_url = 'https://safer.fmcsa.dot.gov/query.asp'
start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']
def parse(self, response):
for code in codes:
data = {
'searchtype': 'ANY',
'query_type': 'queryCarrierSnapshot',
'query_param': 'USDOT',
'query_string': code,
}
yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)
def parse_form(self, response):
cargo = response.xpath('(//table[#summary="Cargo Carried"]/tbody/tr)[2]')
for each in cargo:
each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()
yield {
"X Values": each_x if each_x else "N/A",
}
The following are a few samples code that I am using for the POST REQUEST.
2146709
273286
120670
2036998
690147
I believe all you need is to remove tbody from your XPath here:
cargo = response.xpath('(//table[#summary="Cargo Carried"]/tbody/tr)[2]')
use like this:
cargo = response.xpath('//table[#summary="Cargo Carried"]/tr[2]')
# I also removed the () inside the path because you don't need it, but that didn't cause the problem.
The reason for this is that Scrapy will parse the original code from the page, while your browser may render tbody in case it isn't in the source. Further info here.

Unable to login to a website using a Web Crawler (scrapy)

I am working on a project for which I have to scrape a website "http://app.bmiet.net/student/login" after logging into it. However I can't login using scrapy. I think its because my code is unable to read the CSRF code from the website, however I am still learning to use scrapy and so I am not sure. Please Help me with my code and do tell me whatmy mistake was. The code is given below.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class spidey(scrapy.Spider):
name = 'spidyy'
start_urls = [
'http://app.bmiet.net/student/login'
]
def parse(self, response):
token = response.css('form input::attr(value)').extract_first()
return FormRequest.from_response(response, formdata={
'csrf_token' : token,
'username' : '//username//',
'password' : '//password//'
}, callback = self.start_scrapping)
def start_scrapping(self, response):
open_in_browser(response)
all = response.css('.table-hover td')
for x in all:
att = x.css('td:nth-child(2)::text').extract()
sub = x.css('td~ td+ td::text').extract()
yield {
'Subject': sub,
'Status': att
}
I have removed username and password for obvious reasons.
I am also Sharing what I am getting at the terminal on running the program.
2020-03-21 17:06:49 [scrapy.core.engine] INFO: Spider opened
2020-03-21 17:06:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-21 17:06:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-21 17:06:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.bmiet.net/robots.txt> (referer: None)
2020-03-21 17:06:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.bmiet.net/student/login> (referer: None)
2020-03-21 17:06:54 [scrapy.core.scraper] ERROR: Spider error processing <GET http://app.bmiet.net/student/login> (referer: None)
Traceback (most recent call last):
File "c:\users\administrator\pycharmprojects\sarthak_project\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\administrator\pycharmprojects\sarthak_project\venv\lib\site-packages\scrapy\spiders\__init__.py", line 84, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: spidey.parse callback is not defined
2020-03-21 17:06:54 [scrapy.core.engine] INFO: Closing spider (finished)
I would suggest you reformat your code an indent the methods so that they are part of the class like so:
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class spidey(scrapy.Spider):
name = 'spidyy'
start_urls = [
'http://app.bmiet.net/student/login'
]
def parse(self, response):
token = response.css('form input::attr(value)').extract_first()
return FormRequest.from_response(response, formdata={
'csrf_token' : token,
'username' : '//username//',
'password' : '//password//'
}, callback = self.start_scrapping)
def start_scrapping(self, response):
open_in_browser(response)
all = response.css('.table-hover td')
for x in all:
att = x.css('td:nth-child(2)::text').extract()
sub = x.css('td~ td+ td::text').extract()
yield {
'Subject': sub,
'Status': att
}

Get all URLs in a entire site using Scrapy

folks!
I'm trying to get all internal URLs in entire site for SEO purposes and i recently discovered Scrapy to help me in this task. But my code always returns a error:
2017-10-11 10:32:00 [scrapy.core.engine] INFO: Spider opened
2017-10-11 10:32:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-11 10:32:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-11 10:32:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
<GET http://www.**test**.com/robots.txt>
2017-10-11 10:32:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
<GET http://www.**test**.com>
2017-10-11 10:32:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.**test**.com/> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\python27\lib\site-packages\scrapy\spiders\__init__.py", line 90, in parse
raise NotImplementedError
NotImplementedError
I change the original url.
Here's the code i'm running
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TestSpider(scrapy.Spider):
name = "test"
allowed_domains = ["http://www.test.com"]
start_urls = ["http://www.test.com"]
rules = [Rule (LinkExtractor(allow=['.*']))]
Thanks!
EDIT:
This worked for me:
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
filename = response.url
arquivo = open("file.txt", "a")
string = str(filename)
arquivo.write(string+ '\n')
arquivo.close
=D
The error you are getting is caused by the fact that you don't have defined parse method in your spider, which is mandatory if you base your spider on scrapy.Spider class.
For your purpose (i.e. crawling whole website) it's best to base your spider on scrapy.CrawlSpider class. Also, in Rule, you have to define callback attribute as a method that will parse every page you visit. Last one cosmetic change, in LinkExtractor, if you want to visit every page, you can leave out allow as its default value is empty tuple which means it will match all links found.
Consult a CrawlSpider example for concrete code.

Scrapy doesn't do any scraping after login on salesforce.com based site

I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks

InitSpider not crawling or capturing data

Quite unsure with the informations available which class I should be inheriting from for a crawling spider.
My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:
I get this (new line, followed by left hard bracket):
JSON file
[
Console output
DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)
When running this:
scrapy crawl MY_crawler -o items.json
Using spider:
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem
class MyCrawlerSpider(InitSpider):
name = "MY_crawler"
allowed_domains = ["mydomain.com"]
login_page = 'https://www.mydomain.com/users/sign_in'
start_urls = [
"https://www.mydomain.com/",
]
rules = (
#requires trailing comma to force iterable vs tuple
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
auth_token = response.xpath('authxpath').extract()[0]
return FormRequest.from_response(
response,
formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
callback=self.check_login_response)
def check_login_response(self, response):
if "Signed in successfully" in response.body:
self.log("am logged in")
self.initialized()
else:
self.log("couldn't login")
print response.body
def parse_item(self, response):
item = MycrawlerItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0]
yield item

Categories