How to make scrapy follow invalid links?

How to make scrapy follow invalid links? - python

I use scrapy very often to check long lists of links whether they're available or dead.
My problem is when the link is incorrectly formatted for example doesn't start with http:// or https:// the crawler crashes.
ValueError: Missing scheme in request url: http.www.gobiernoenlinea.gob.ve/noticias/viewNewsUser01.jsp?applet=1&id_noticia=41492
I read the list of links from pandas Series and check each of them. When the response is reachable I log it as "ok" otherwise as "dead".
import scrapy
import pandas as pd
from link_checker.items import LinkCheckerItem
class Checker(scrapy.Spider):
name = "link_checker"
def get_links(self):
df = pd.read_csv(r"final_07Sep2018.csv")
return df["Value"]
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.get_response, errback=self.errback_httpbin, meta=index, dont_filter=True)
def get_response(self, response):
url = response.url
yield LinkCheckerItem(index=response.meta["index"], url=url, code="ok")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="dead")
I am still interested in spotting those incorrectly formatted urls. How can I validate them and yield "dead" for those as well?

You just check if it starts with https and http
If not, then prepend http manually.
if not LINK.startswith('http:') and not LINK.startswith('https:'):
LINK = "http://" + LINK

Related

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!

Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.

Can't get desired results using try/except clause within scrapy

I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.
I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.
website link
This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text,"lxml")
proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxy
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503]
proxy_vault = get_proxies()
check_url = "https://yts.am/browse-movies"
def start_requests(self):
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
def parse(self,response):
print(response.meta)
if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield scrapy.Request(nlink,callback=self.parse_details)
def parse_details(self,response):
name = response.css("#movie-info h1::text").get()
yield {"Name":name}
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()
To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.
As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.
My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]
Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?

According to scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs
(and source)
proxy meta key is expected to use (not https_proxy)
#request.meta['https_proxy'] = f'http://{proxy_url}'
request.meta['proxy'] = f'http://{proxy_url}'
As scrapy didn't received valid meta key - your scrapy application didn't use proxies

The start_requests() function is just the entry point. On subsequent requests, you would need to resupply this metadata to the Request object.
Also, errors can occur on two levels: proxy and target server
We need to handle bad response codes from both the proxy and the target server. Proxy errors are returned by the middelware to the errback function. The target server response can be handled during parsing from the response.status
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text, "lxml")
proxy = [':'.join([item.select_one("td").text, item.select_one("td:nth-of-type(2)").text]) for item in
soup.select("table.table tbody tr") if "yes" in item.text]
# proxy = ['https://52.0.0.1:8090', 'https://52.0.0.2:8090']
return proxy
def get_random_proxy(proxy_vault):
random.shuffle(proxy_vault)
proxy_url = next(cycle(proxy_vault))
return proxy_url
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503, 502, 401, 403]
check_url = "https://yts.am/browse-movies"
proxy_vault = get_proxies()
def handle_middleware_errors(self, *args, **kwargs):
# implement middleware error handling here
print('Middleware Error')
# retry request with different proxy
yield self.make_request(url=args[0].request._url, callback=args[0].request._meta['callback'])
def start_requests(self):
yield self.make_request(url=self.check_url, callback=self.parse)
def make_request(self, url, callback, dont_filter=True):
return scrapy.Request(url,
meta={'proxy': f'https://{get_random_proxy(self.proxy_vault)}', 'callback': callback},
callback=callback,
dont_filter=dont_filter,
errback=self.handle_middleware_errors)
def parse(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handling here - this loops forever
print(f'Status code: {response.status}')
raise
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield self.make_request(url=nlink, callback=self.parse_details)
except:
# if anything goes wrong fetching the lister page, try again
yield self.make_request(url=self.check_url, callback=self.parse)
def parse_details(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handeling here - this loops forever
print(f'Status code: {response.status}')
raise
name = response.css("#movie-info h1::text").get()
yield {"Name": name}
except:
# if anything goes wrong fetching the detail page, try again
yield self.make_request(url=response.request._url, callback=self.parse_details)
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()

Why is the print() function not echoing to console?

I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.

Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')

Can't get all http requests in Scrapy

I'm trying to access a site and check if no links redirecting to a page within the site that are down. As there is no sitemap available, I'm using Scrapy to crawl the site and get all links on every page, but I can't get it to output a file with all the links found and their status code. The site I'm using to test the code is quotes.toscrape.com and my code is:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
I've tried adding the following lines after yield:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
But it gets me a bunch of duplicates and no url with status 404 or 301. Does anyone know how I can get all the urls with the status?

Scrapy by default does not return any unsuccessful requests, but you can fetch them and handle them in one of your functions if you set errback on the request.
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
The parameter failure will contain more information on the exact reason for failure, because it could be HTTP errors (where you can fetch a response), but also DNS lookup errors and such (where there is no response).
The documentation contains an example how to use failure to determine the error reason and access Response if available:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)

You should use the HTTPERROR_ALLOW_ALL in your settings or set the meta key handle_httpstatus_all = Truein all your requests, please refer to the docs for more information.

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

I am using sitemap spider in scrapy, python.
The sitemap seems to have unusual format with '//' in front of urls:
<url>
<loc>//www.example.com/10/20-baby-names</loc>
</url>
<url>
<loc>//www.example.com/elizabeth/christmas</loc>
</url>
myspider.py
from scrapy.contrib.spiders import SitemapSpider
from myspider.items import *
class MySpider(SitemapSpider):
name = "myspider"
sitemap_urls = ["http://www.example.com/robots.txt"]
def parse(self, response):
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()
return item
I am getting this error:
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: //www.example.com/10/20-baby-names
How can I manually parse the url using sitemap spider?

If I see it correctly, you could (for a quick solution) override the default implementation of _parse_sitemap in SitemapSpider. It's not nice, because you will have to copy a lot of code, but should work.
You'll have to add a method to generate a URL with scheme.
"""if the URL starts with // take the current website scheme and make an absolute
URL with the same scheme"""
def _fix_url_bug(url, current_url):
if url.startswith('//'):
':'.join((urlparse.urlsplit(current_url).scheme, url))
else:
yield url
def _parse_sitemap(self, response):
if response.url.endswith('/robots.txt'):
for url in sitemap_urls_from_robots(response.body)
yield Request(url, callback=self._parse_sitemap)
else:
body = self._get_sitemap_body(response)
if body is None:
log.msg(format="Ignoring invalid sitemap: %(response)s",
level=log.WARNING, spider=self, response=response)
return
s = Sitemap(body)
if s.type == 'sitemapindex':
for loc in iterloc(s):
# added it before follow-test, to allow test to return true
# if it includes the scheme (yet do not know if this is the better solution)
loc = _fix_url_bug(loc, response.url)
if any(x.search(loc) for x in self._follow):
yield Request(loc, callback=self._parse_sitemap)
elif s.type == 'urlset':
for loc in iterloc(s):
loc = _fix_url_bug(loc, response.url) # same here
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
This is just a general idea and untested. So it could both either totally not work or there could be syntax errors. Please respond via comments, so I can improve my answer.
The sitemap you are trying to parse, seems to be wrong. From RFC a missing scheme is perfectly fine, but sitemaps require URLs to begin with a scheme.

I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing.
import re
import urlparse
from scrapy.http import XmlResponse
from scrapy.utils.gz import gunzip, is_gzipped
from scrapy.contrib.spiders import SitemapSpider
# downloader middleware
class SitemapWithoutSchemeMiddleware(object):
def process_response(self, request, response, spider):
if isinstance(spider, SitemapSpider):
body = self._get_sitemap_body(response)
if body:
scheme = urlparse.urlsplit(response.url).scheme
body = re.sub(r'<loc>\/\/(.+)<\/loc>', r'<loc>%s://\1</loc>' % scheme, body)
return response.replace(body=body)
return response
# this is from scrapy's Sitemap class, but sitemap is
# only for internal use and it's api can change without
# notice
def _get_sitemap_body(self, response):
"""Return the sitemap body contained in the given response, or None if the
response is not a sitemap.
"""
if isinstance(response, XmlResponse):
return response.body
elif is_gzipped(response):
return gunzip(response.body)
elif response.url.endswith('.xml'):
return response.body
elif response.url.endswith('.xml.gz'):
return gunzip(response.body)

I used the trick by #alecxe to parse the urls within the spider. I made it work but not sure if it is the best way to do it.
from urlparse import urlparse
import re
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.utils.response import body_or_str
from example.items import *
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["http://www.example.com/sitemap.xml"]
def parse(self,response):
nodename = 'loc'
text = body_or_str(response)
r = re.compile(r"(<%s[\s>])(.*?)(</%s>)" % (nodename, nodename), re.DOTALL)
for match in r.finditer(text):
url = match.group(2)
if url.startswith('//'):
url = 'http:'+url
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
# print response.url
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()
return item

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make scrapy follow invalid links? - python

You just check if it starts with https and http If not, then prepend http manually. if not LINK.startswith('http:') and not LINK.startswith('https:'): LINK = "http://" + LINK

Related

scrapy.Request doesn't enter the download middleware, it returns Request instead of response

Can't get desired results using try/except clause within scrapy

Why is the print() function not echoing to console?

Can't get all http requests in Scrapy

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

Categories

Resources