I'm trying to access a site and check if no links redirecting to a page within the site that are down. As there is no sitemap available, I'm using Scrapy to crawl the site and get all links on every page, but I can't get it to output a file with all the links found and their status code. The site I'm using to test the code is quotes.toscrape.com and my code is:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
I've tried adding the following lines after yield:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
But it gets me a bunch of duplicates and no url with status 404 or 301. Does anyone know how I can get all the urls with the status?
Scrapy by default does not return any unsuccessful requests, but you can fetch them and handle them in one of your functions if you set errback on the request.
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
The parameter failure will contain more information on the exact reason for failure, because it could be HTTP errors (where you can fetch a response), but also DNS lookup errors and such (where there is no response).
The documentation contains an example how to use failure to determine the error reason and access Response if available:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
You should use the HTTPERROR_ALLOW_ALL in your settings or set the meta key handle_httpstatus_all = Truein all your requests, please refer to the docs for more information.
Related
I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.
I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.
website link
This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text,"lxml")
proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxy
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503]
proxy_vault = get_proxies()
check_url = "https://yts.am/browse-movies"
def start_requests(self):
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
def parse(self,response):
print(response.meta)
if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield scrapy.Request(nlink,callback=self.parse_details)
def parse_details(self,response):
name = response.css("#movie-info h1::text").get()
yield {"Name":name}
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()
To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.
As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.
My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]
Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?
According to scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs
(and source)
proxy meta key is expected to use (not https_proxy)
#request.meta['https_proxy'] = f'http://{proxy_url}'
request.meta['proxy'] = f'http://{proxy_url}'
As scrapy didn't received valid meta key - your scrapy application didn't use proxies
The start_requests() function is just the entry point. On subsequent requests, you would need to resupply this metadata to the Request object.
Also, errors can occur on two levels: proxy and target server
We need to handle bad response codes from both the proxy and the target server. Proxy errors are returned by the middelware to the errback function. The target server response can be handled during parsing from the response.status
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text, "lxml")
proxy = [':'.join([item.select_one("td").text, item.select_one("td:nth-of-type(2)").text]) for item in
soup.select("table.table tbody tr") if "yes" in item.text]
# proxy = ['https://52.0.0.1:8090', 'https://52.0.0.2:8090']
return proxy
def get_random_proxy(proxy_vault):
random.shuffle(proxy_vault)
proxy_url = next(cycle(proxy_vault))
return proxy_url
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503, 502, 401, 403]
check_url = "https://yts.am/browse-movies"
proxy_vault = get_proxies()
def handle_middleware_errors(self, *args, **kwargs):
# implement middleware error handling here
print('Middleware Error')
# retry request with different proxy
yield self.make_request(url=args[0].request._url, callback=args[0].request._meta['callback'])
def start_requests(self):
yield self.make_request(url=self.check_url, callback=self.parse)
def make_request(self, url, callback, dont_filter=True):
return scrapy.Request(url,
meta={'proxy': f'https://{get_random_proxy(self.proxy_vault)}', 'callback': callback},
callback=callback,
dont_filter=dont_filter,
errback=self.handle_middleware_errors)
def parse(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handling here - this loops forever
print(f'Status code: {response.status}')
raise
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield self.make_request(url=nlink, callback=self.parse_details)
except:
# if anything goes wrong fetching the lister page, try again
yield self.make_request(url=self.check_url, callback=self.parse)
def parse_details(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handeling here - this loops forever
print(f'Status code: {response.status}')
raise
name = response.css("#movie-info h1::text").get()
yield {"Name": name}
except:
# if anything goes wrong fetching the detail page, try again
yield self.make_request(url=response.request._url, callback=self.parse_details)
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()
I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.
Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I use scrapy very often to check long lists of links whether they're available or dead.
My problem is when the link is incorrectly formatted for example doesn't start with http:// or https:// the crawler crashes.
ValueError: Missing scheme in request url: http.www.gobiernoenlinea.gob.ve/noticias/viewNewsUser01.jsp?applet=1&id_noticia=41492
I read the list of links from pandas Series and check each of them. When the response is reachable I log it as "ok" otherwise as "dead".
import scrapy
import pandas as pd
from link_checker.items import LinkCheckerItem
class Checker(scrapy.Spider):
name = "link_checker"
def get_links(self):
df = pd.read_csv(r"final_07Sep2018.csv")
return df["Value"]
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.get_response, errback=self.errback_httpbin, meta=index, dont_filter=True)
def get_response(self, response):
url = response.url
yield LinkCheckerItem(index=response.meta["index"], url=url, code="ok")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="dead")
I am still interested in spotting those incorrectly formatted urls. How can I validate them and yield "dead" for those as well?
You just check if it starts with https and http
If not, then prepend http manually.
if not LINK.startswith('http:') and not LINK.startswith('https:'):
LINK = "http://" + LINK
I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/).
My spider:
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
yield Request('https://forums.craigslist.org/',
self.getForumPage,
dont_filter=True,
errback=self.error_handler)
def getForumPage(self, response):
print "forum page"
I have this message by the error callback:
[Failure instance: Traceback: :
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult
--- ---
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks
/usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2
]
But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website.
So, impossible to get a response...
An idea ?
I post a solution that I found for get around the problem.
I have used urllib2 library. Look:
import urllib2
from scrapy.http import HtmlResponse
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
# Get a valid request with urllib2
req = urllib2.Request('https://forums.craigslist.org/')
# Get the content of this request
pageContent = urllib2.urlopen(req).read()
# Parse the content in a HtmlResponse compatible with Scrapy
response = HtmlResponse(url=response.url, body=pageContent)
print response.css(".forumlistcolumns li").extract()
With this solution, you can parse a good request in a valid Scrapy request and use this normaly.
There is probably a better method but this one is functional.
I think you are dealing with robots.txt. Try running your spider with
custom_settings = {
"ROBOTSTXT_OBEY": False
}
You can also test it using command line settings: scrapy crawl craigslist -s ROBOTSTXT_OBEY=False.
While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this:
DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm)
To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm again (maybe a few more times), I should get it. How do I set the number of tries for that?
If it matters, I'm only allowing the crawler to use a very specific starting url and then only follow "next page" links, so it should go in order through a small number of pages - hence why I need it to either retry, e.g., page 34, or come back to it later. Scrapy documentation says it should retry 20 times by default, but I don't see it retrying at all. Also if it helps: All redirects go to the same page (a "go away" page, the foo.com above) - is there a way to tell Scrapy that that particular page "doesn't count" and if it's getting redirected there, to keep retrying? I saw something in the downloader middleware referring to particular http codes in a list - can I add 302 to the "always keep trying this" list somehow?
I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_meta_refresh
from scrapy import log
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'redirect %d' %response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
log.msg("trying to redirect us: %s" %url, level=log.INFO)
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = HtmlXPathSelector(response)
# test for captcha page
captcha = hxs.select(".//input[contains(#id, 'captchacharacters')]").extract()
if captcha:
log.msg("captcha page %s" %url, level=log.INFO)
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:
DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
}
You can handle 302 responses by adding handle_httpstatus_list = [302] at the beginning of your spider like so:
class MySpider(CrawlSpider):
handle_httpstatus_list = [302]
def parse(self, response):
if response.status == 302:
# Store response.url somewhere and go back to it later