Scrapy aborts on HTTP 401 - python

I'm having trouble with Python Scrapy.
I have a spider that attempts to login to a site before crawling it, however the site is configured to return response code HTTP 401 on the login page which stops the spider from continuing (even though in the body of that response, the login form is there for submitting).
This is the relevant parts of my crawler:
class LoginSpider(Spider):
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)
The above yields:
2014-02-23 11:52:09+0000 [login] DEBUG: Crawled (401) <GET https://example.com/login> (referer: None)
2014-02-23 11:52:09+0000 [login] INFO: Closing spider (finished)
However if I give it another URL to start on (not the login page) which returns a 200:
2014-02-23 11:50:19+0000 [login] DEBUG: Crawled (200) <GET https://example.com/other-page> (referer: None)
2014-02-23 11:50:19+0000 [login] INFO: Logging in...
You see it goes on to execute my parse() method and make the log entry.
How do I make Scrapy continue to work with the page despite a 401 response code?

On the off-chance this question isn't closed as a duplicate, explicitly adding 401 to handle_httpstatus_list fixed the issue
class LoginSpider(Spider):
handle_httpstatus_list = [401]
name = "login"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Initial user/pass submit
self.log("Logging in...", level=log.INFO)

Related

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?
Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.
The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Scrapy doing retry after yield

I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or other reasons, but whatever). I would like to do a retry for all those fail scraping.
As it is too long to read all, I would like to summary some directions for my questions are:
1) Can I execute the retry using CustomRetryMiddleware only in one specific method
2) Can I do something after finish the first scraping
Okay let's start
The overall logic of my code is as below:
Crawl the homepage of forum
Crawl into every post from the homepage
Scrape the data from the post
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
def parse_page(self, response): //Going into all the threads
hrefs = response.xpath('blahblah')
for href in hrefs:
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_post)
def parse_post(self, response): //really scraping the content
content_empty = len(response.xpath('//table[#class="content"]') //check if the content is empty
if content_empty == 0:
//do something
item = ForumItem()
item['some_content'] = response.xpath('//someXpathCode')
yield item
I have read lots from stackoverflow, and thought I can do it in two ways (and have done some coding):
1) Create a custom RetryMiddleware
2) Do the retry just inside the spider
However I am doing both of them with no lucks. The failure reasons is as below:
For Custom RetryMiddleware, I followed this, but it will check through all the page I crawled, including robot.txt, so it always retrying. But what I want is only do the retry check inside parse_post. Is this possible?
For retry inside the spider, I have tried two approacch.
First, I added a class variable _posts_not_crawled = [] and append it with response.url if the empty check is true. Adjust the code of start_requests to do the retry of all fail scraping after finishing scraping for the first time:
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
while self._post_not_crawled:
yield scrapy.Request(self._post_not_crawled.pop(0), callback=self.parse_post)
But of course it doesn't work, because it executes before actually scraping data, so it will only execute once with an empty _post_not_crawled list before start scraping. Is it possible to do something after finish first scraping?
Second trial is to directly retry inside the parse_post()
if content_empty == 0:
logging.warning('Post was empty: ' + response.url)
retryrequest = scrapy.Request(response.url, callback=self.parse_post)
retryrequest.dont_filter = True
return retryrequest
else:
//do the scraping
Update some logs from this method
2017-09-03 05:15:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778647> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:43 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778647
2017-09-03 05:15:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778568> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:44 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778568
2017-09-03 05:15:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6774780> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:46 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6774780
But it doesn't work either, and the retryrequest was just skipped without any sign.
Thanks for reading all of this. I appreciate all of your help.

Scrapy https tutorial

everyone!
I'm new to Scrapy framework. And I need to parse wisemapping.com.
At first, I read official Scrapy tutorial and tried to get access to one of "wisemap" 's, but got an errors:
[scrapy.core.engine] DEBUG: Crawled (404) <GET https://app.wisemapping.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying
<GET https://app.wisemapping.com/c/maps/576786/public> (failed 3 times): 500 Internal Server Error
[scrapy.core.engine] DEBUG: Crawled (500) <GET https://app.wisemapping.com/c/maps/576786/public> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://app.wisemapping.com/c/maps/576786/public>: HTTP status code is not handled or not allowed
Please, give me an advice to solve problems with following code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://app.wisemapping.com/c/maps/576786/public',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'wisemape.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Navigating to https://app.wisemapping.com/c/maps/576786/public gives the error
"Outch!!. This map is not available anymore.
You do not have enough right access to see this map. This map has been changed to private or deleted."
Does this map exist? If so, try making it public.
If you know for a fact the map you're trying to access exist, verify the URL you're trying to access is the correct one.

Scrapy doesn't do any scraping after login on salesforce.com based site

I'm new to scrapy and have an issue logging in to a salesforce.com based site. I use the loginform package to populate scrapy's FromRequest. When run, it does a GET of the login page and a sucessful POST of the FormRequest login as expected. But then the spider stops, no page gets scraped.
[...]
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testdomain.secure.force.com/jSites_Home> (referer: None)
2017-06-25 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://login.salesforce.com/> (referer: https://testdomain.secure.force.com/jSites_Home)
2017-06-25 14:02:29 [scrapy.core.engine] INFO: Closing spider (finished)
[...]
The (slightly redacted) script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from loginform import fill_login_form
from harvest.items import HarvestItem
class TestSpider(Spider):
name = 'test'
allowed_domains = ['testdomain.secure.force.com', 'login.salesforce.com']
login_url = 'https://testdomain.secure.force.com/jSites_Home'
login_user = 'someuser'
login_password = 'p4ssw0rd'
def start_requests(self):
yield scrapy.Request(self.login_url, self.parse_login)
def parse_login(self, response):
data, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_password)
return scrapy.FormRequest(url, formdata=dict(data), method=method, callback=self.get_assignments)
def get_assignments(self, response):
assignment_selector = response.xpath('//*[#id="nav"]/ul/li/a[#title="Assignments"]/#href')
return Request(urljoin(response.url, assignment_selector.extract()), callback=self.parse_item)
def parse_item(self, response):
items = HarvestItem()
items['startdatum'] = response.xpath('(//*/table[#class="detailList"])[2]/tbody/tr[1]/td[1]/span/text()')\
.extract()
return items
When I check the body of the FormRequest, it looks like a legit POST to the page 'login.salesforce.com'. If I login manually, I notice several redirects. However, when I force a parse by adding a "callback='parse'" to the FormRequest, still nothing happens.
Am I right in thinking the login went OK, looking at the 200 response?
I don't see any redirects in the scrapy output. Could it be that scrapy doesn't handle the redirects properly, causing the script to not do any scraping?
Any ideas on getting the script to scrape the final redirected page after login?
Thanks

InitSpider not crawling or capturing data

Quite unsure with the informations available which class I should be inheriting from for a crawling spider.
My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:
I get this (new line, followed by left hard bracket):
JSON file
[
Console output
DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)
When running this:
scrapy crawl MY_crawler -o items.json
Using spider:
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem
class MyCrawlerSpider(InitSpider):
name = "MY_crawler"
allowed_domains = ["mydomain.com"]
login_page = 'https://www.mydomain.com/users/sign_in'
start_urls = [
"https://www.mydomain.com/",
]
rules = (
#requires trailing comma to force iterable vs tuple
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
auth_token = response.xpath('authxpath').extract()[0]
return FormRequest.from_response(
response,
formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
callback=self.check_login_response)
def check_login_response(self, response):
if "Signed in successfully" in response.body:
self.log("am logged in")
self.initialized()
else:
self.log("couldn't login")
print response.body
def parse_item(self, response):
item = MycrawlerItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0]
yield item

Categories