Crawling of relative URL's with Scrapy [Python] - python

I'm SEO specialist, not really into coding.. But want to try to create a broken links checker in Python with Scrapy module, which will crawl my website and will show me all internal links with 404 code..
So far I have managed to write this code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import Broken
class Spider(CrawlSpider):
name = 'example'
handle_httpstatus_list = [404]
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_info', follow=True)]
def parse_info(self, response):
report = [404]
if response.status in report:
Broken_URLs = Broken()
#Broken_URLs['title']= response.xpath('/html/head/title').get()
Broken_URLs['referer'] = response.request.headers.get('Referer', None)
Broken_URLs['status_code']= response.status
Broken_URLs['url']= response.url
Broken_URLs['anchor']= response.meta.get('link_text')
return Broken_URLs
It's crawling well, as long as we have absolute url's in the site structure.
But there some cases when the crawler comes across with relative url's and end up with this kind of links:
Normally should be:
https://www.example.com/en/...
But it gives me:
https://www.example.com/en/en/... - double language folder, which end up with 404 code.
I'm trying to find a way to override this language duplication, with correct structure at the end.
Does somebody know the way how to fix it? Will much appreciate it!

Scrapy use urllib.parse.urljoin for working with relative urls.
You can fix it by adding custom function into process_request in Rule definition:
def fix_urls():
def process_request(request, response):
return request.replace(url=request.url.replace("/en/en/", "/en/"))
return process_request
class Spider(CrawlSpider):
name = 'example'
...
rules = [Rule(LinkExtractor(), process_request=fix_urls(), callback='parse_info', follow=True)]

Related

Python: why is in scrapy crawlspider not printing or doing anything?

I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

Scraping a domain for links recursively using Scrapy

Here is the code I'm using for scraping all the urls of a domain:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class UrlsSpider(scrapy.Spider):
name = 'urlsspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))
def parse(self, response):
for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
print link.url
yield scrapy.Request(link.url, callback=self.parse)
As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.
Any help on this matter will be very helpful.
Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

Python Scrapy 301 redirects

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is:
import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'rust'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
#if response.status == 301:
print response.url
However, this does not print the redirected urls. Any help will be appreciated.
Thank you.
To parse any responses that are not 200 you'd need to do one of these things:
Project-wide
You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.
Spider-wide
Add handle_httpstatus_list parameter to your spider. In your case something like:
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301]
# or
handle_httpstatus_all = True
Request-wide
You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
To learn more see HttpErrorMiddleware

Using python scrapy to extract links from a webpage

I am a beginner with python and using scrapy to extract links from the following webpage
http://www.basketball-reference.com/leagues/NBA_2015_games.html.
The code that I have written is
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
I run this code through the command prompt, but the file created does not have any links. Could someone please help?
It cannot find the links, fix you regular expression in the rule:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
Also, you don't have to set the callback when it is called parse_item - it is a default.
And allow can be set as a string also.
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]

Categories