Crawling using Crawl spider class in Scrapy - python

I am trying to recursively crawl URL's on a webpage and then parse these pages to get all the tags on the page. I tried crawling a single page using scrapy without recursively going into URL's on the page and it worked, but when I tried to change my code to make it crawl the entire site it crawls the site but gives a really weird error at the end. The code for the spider and the error is given below, the code takes the list of domains to crawl as an argument in a file.
import scrapy
from tags.items import TagsItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class TagSpider(scrapy.Spider):
name = "getTags"
allowed_domains = []
start_urls = []
rules = (Rule(LinkExtractor(), callback='parse_tags', follow=True),)
def __init__(self, filename=None):
for line in open(filename, 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)
def parse_start_url(self,response):
return self.parse_tags(response)
def parse_tags(self, response):
for sel in response.xpath('//*').re(r'</?\w+\s+[^>]*>'):
item = TagsItem()
item['tag'] = sel
item['url'] = response.url
print item
This is the error dump I am getting:

Related

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

parsing url links into a list

I have made a spider using scrapy and I am trying to save download links into a (python) list, so I can later call a list entry using downloadlist[1].
But scrapy saves the urls as items instead of as a list. Is there a way to append each url into a list?
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor
DOMAIN = 'some-domain.com'
URL = 'http://' +str(DOMAIN)
linklist = []
class subtitles(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
# First parse returns all the links of the website and feeds them to parse2
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/#href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
yield Request(url, callback=self.parse2)
# Second parse selects only the links that contains download
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse2)
print link.url
# prints list of urls, 'downloadlist' should be a list but isn't.
downloadlist = subtitles()
print downloadlist
You are misunderstanding how classes work, you are calling a class here not a function.
Think about it this way, your spider tht you define in class MySpider(Spider) is a template that is used by scrapy engine; when you start scrapy crawl myspider scrapy starts up an engine and reads your template to create an object that will be used to process various responses.
So your idea here can be simply translated to:
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield {'url': link.urk}
If you call this with scrapy crawl myspider -o items.json you'll get all of the download links in json format.
There no reason to save downloads of to a list since it will be no longer of this spider template (class) that you wrote up and essentially it will have no purpose.

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):

scrapy crawl multiple pages, extracting data and saving into mysql

Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.

Scrapy is Visiting same Url despite dont_filter=False

Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items
Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.

Categories