Scrapy crawl all sitemap links - python

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url

Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links

You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.
For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph

Related

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):

scrapy crawl multiple page using Request

I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.

Scrapy - Scrape multiple URLs using results from the first URL

I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work

using scrapy to extract dynamic data - location based on postcodes

I'm new to Scrapy, and with some tutorials I was able to scrape a few simple websites, but I'm facing an issue now with a new website where I have to fill a search form and extract the results. The response I get doesn't have the results.
Let's say for example, for the following site: http://www.beaurepaires.com.au/store-locator/
I want to provide a list of postcodes and extract information about stores in each postcode (store name and address).
I'm using the following code but it's not working, and I'm not sure where to start from.
class BeaurepairesSpider(BaseSpider):
name = "beaurepaires"
allowed_domains = ["http://www.beaurepaires.com.au"]
start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
#start_urls = ["http://www.beaurepaires.com.au/"]
def parse(self, response):
yield FormRequest.from_response(response, formname='frm_dealer_locator', formdata={'dealer_postcode_textfield':'2115'}, callback=self.parseBeaurepaires)
def parseBeaurepaires(self, response):
hxs = HtmlXPathSelector(response)
filename = "postcodetest3.txt"
open(filename, 'wb').write(response.body)
table = hxs.select("//div[#id='jl_results']/table/tbody")
headers = table.select("tr[position()<=1]")
data_rows = table.select("tr[position()>1]")
Thanks!!
The page load here heavily uses javascript and is too complex for Scrapy. Here's an example of what I've come up to:
import re
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class BeaurepairesSpider(BaseSpider):
name = "beaurepaires"
allowed_domains = ["beaurepaires.com.au", "gdt.rightthere.com.au"]
start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
def parse(self, response):
yield FormRequest.from_response(response, formname='frm_dealer_locator',
formdata={'dealer_postcode_textfield':'2115'},
callback=self.parseBeaurepaires)
def parseBeaurepaires(self, response):
hxs = HtmlXPathSelector(response)
script = str(hxs.select("//div[#id='jl_container']/script[4]/text()").extract()[0])
url, script_name = re.findall(r'LoadScripts\("([a-zA-Z:/\.]+)", "(\w+)"', script)[0]
url = "%s/locator/js/data/%s.js" % (url, script_name)
yield Request(url=url, callback=self.parse_js)
def parse_js(self, response):
print response.body # here are your locations - right, inside the js file
see that regular expressions are used, hardcoded urls, and you'll have to parse js in order to get your locations - too fragile even if you'll finish it and get the locations.
Just switch to in-browser tools like selenium (or combine scrapy with it).

How to follow link with scrappy

How to follow links in this example : http://snippets.scrapy.org/snippets/7/ ?
The script stop after visiting the link of the first page.
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[#id="content"]//div[contains(#class, "question-summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/#title')
loader.add_xpath('tags', './/a[#rel="tag"]/text()')
loader.add_xpath('user', './/div[#class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[#class="started"]/a[1]/span/#title')
loader.add_xpath('votes', './/div[#class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(#class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[#class="views"]/div[1]/text()')
yield loader.load_item()
i've tried to change :
class MySpider(BaseSpider):
To
class MySpider(CrawlSpider)
And add
rules = (
Rule(SgmlLinkExtractor(allow=()),
callback='parse',follow=True),
)
But it doesn't crawl all the site
Thanks,
Yes, you need to subclass CrawlSpider, and rename parse function to something like parse_page, because CrawlSpider uses parse to start scraping.
This was already answered

Categories