I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
Related
I'm SEO specialist, not really into coding.. But want to try to create a broken links checker in Python with Scrapy module, which will crawl my website and will show me all internal links with 404 code..
So far I have managed to write this code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import Broken
class Spider(CrawlSpider):
name = 'example'
handle_httpstatus_list = [404]
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_info', follow=True)]
def parse_info(self, response):
report = [404]
if response.status in report:
Broken_URLs = Broken()
#Broken_URLs['title']= response.xpath('/html/head/title').get()
Broken_URLs['referer'] = response.request.headers.get('Referer', None)
Broken_URLs['status_code']= response.status
Broken_URLs['url']= response.url
Broken_URLs['anchor']= response.meta.get('link_text')
return Broken_URLs
It's crawling well, as long as we have absolute url's in the site structure.
But there some cases when the crawler comes across with relative url's and end up with this kind of links:
Normally should be:
https://www.example.com/en/...
But it gives me:
https://www.example.com/en/en/... - double language folder, which end up with 404 code.
I'm trying to find a way to override this language duplication, with correct structure at the end.
Does somebody know the way how to fix it? Will much appreciate it!
Scrapy use urllib.parse.urljoin for working with relative urls.
You can fix it by adding custom function into process_request in Rule definition:
def fix_urls():
def process_request(request, response):
return request.replace(url=request.url.replace("/en/en/", "/en/"))
return process_request
class Spider(CrawlSpider):
name = 'example'
...
rules = [Rule(LinkExtractor(), process_request=fix_urls(), callback='parse_info', follow=True)]
I have made a spider using scrapy and I am trying to save download links into a (python) list, so I can later call a list entry using downloadlist[1].
But scrapy saves the urls as items instead of as a list. Is there a way to append each url into a list?
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor
DOMAIN = 'some-domain.com'
URL = 'http://' +str(DOMAIN)
linklist = []
class subtitles(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
# First parse returns all the links of the website and feeds them to parse2
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/#href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
yield Request(url, callback=self.parse2)
# Second parse selects only the links that contains download
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse2)
print link.url
# prints list of urls, 'downloadlist' should be a list but isn't.
downloadlist = subtitles()
print downloadlist
You are misunderstanding how classes work, you are calling a class here not a function.
Think about it this way, your spider tht you define in class MySpider(Spider) is a template that is used by scrapy engine; when you start scrapy crawl myspider scrapy starts up an engine and reads your template to create an object that will be used to process various responses.
So your idea here can be simply translated to:
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield {'url': link.urk}
If you call this with scrapy crawl myspider -o items.json you'll get all of the download links in json format.
There no reason to save downloads of to a list since it will be no longer of this spider template (class) that you wrote up and essentially it will have no purpose.
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code:
from scrapy import Spider
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.selector import Selector
from scrapyTest.items import TestItem
import urlparse
class TestSpider(Spider):
name = 'TestSpider'
allowed_domains = ['pyzaist.com']
start_urls = ['http://pyzaist.com/drone']
def parse(self, response):
item = TestItem()
item["url"] = response.url
yield item
links = response.xpath("//a/#href").extract()
for link in links:
yield Request(urlparse.urljoin(response.url, link))
This does the job, but throws an error whenever the response is just a Response, not a TextResponse or HtmlResponse. This is because there is no Response.xpath(). I tried to test for this by doing:
if type(response) is TextResponse:
links = response.xpath("//a#href").extract()
...
But to no avail. When I do that, it never enters the if statement. I am new to Python, so it might be a language thing. I appreciate any help.
Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works:
if isinstance(response, TextResponse):
links = response.xpath("//a/#href").extract()
...
https://stackoverflow.com/a/2225066/1455074, near the bottom
I am trying to recursively crawl URL's on a webpage and then parse these pages to get all the tags on the page. I tried crawling a single page using scrapy without recursively going into URL's on the page and it worked, but when I tried to change my code to make it crawl the entire site it crawls the site but gives a really weird error at the end. The code for the spider and the error is given below, the code takes the list of domains to crawl as an argument in a file.
import scrapy
from tags.items import TagsItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class TagSpider(scrapy.Spider):
name = "getTags"
allowed_domains = []
start_urls = []
rules = (Rule(LinkExtractor(), callback='parse_tags', follow=True),)
def __init__(self, filename=None):
for line in open(filename, 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)
def parse_start_url(self,response):
return self.parse_tags(response)
def parse_tags(self, response):
for sel in response.xpath('//*').re(r'</?\w+\s+[^>]*>'):
item = TagsItem()
item['tag'] = sel
item['url'] = response.url
print item
This is the error dump I am getting: